Openmpi Signal Bus Error
Hardware Locality Network Locality MPI Testing Tool Open MPI User Docs Open Tool for Parameter Optimization PMIx Community Mailing Lists OMPI announce archives OMPI users mpi bus error (signal 7) archives OMPI devel archives hwloc announce archives hwloc users archives hwloc devel signal 7 bus error archives MTT users archives MTT developers archives Getting Help/Support Contribute Mirrors Contact License This web mail archive is frozen. This page is part of a frozen web archive of this mailing list. You can still navigate around this archive, but know that no new mails have been added to it since July of 2016. Click here to be taken to the new web archives of this list The new archive includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives. Date view Thread view Subject view Author view Subject: [OMPI users] Bus Error (7) on PS3 running HPL (OpenMPI 1.2.8) From: Hoelzlwimmer Andreas - S0810595005 (S0810595005_at_[hidden]) Date: 2009-08-04 11:10:18 Next message: Jean-Christophe Ducom: "[OMPI users] Kerberos ticket forwarding" Previous message: Jeff Squyres: "Re: [OMPI users] OMPI users] MPI_IN_PLACE in FortranwithMPI_REDUCE / MPI_ALLREDUCE" Next in thread: Jeff Squyres: "Re: [OMPI users] Bus Error (7) on PS3 running HPL (OpenMPI 1.2.8)" Reply: Jeff Squyres: "Re: [OMPI users] Bus Error (7) on PS3 running HPL (OpenMPI 1.2.8)" Hello, I've wanted to run MPI on a couple of PS3 here. According to a colleague who set it up, I had to set several HugePages. As the PS3 RAM is limited I had to allocate 2 HugePages. I ran HPL at first with the following command (out of a tutorial): mpirun --mca btl_openib_want_fork_support 0 -np 1 numactl --physcpubind=0 ./xhpl : -np 1 numactl --physcpubind=1 ./xhpl Now as I had very little memory I had to disable some services. I did so (Wifi Service, Bluetooth, printing, unneeded). After running the same command again, I got the an error message (see below). Can
found that after upgrading from v1.1.4 to v1.2 that jobs using np > 4 would fail to start during MPI_Init, due to what appears to be a lack of space in /tmp. The error output is:-----[tpb200:32193] *** Process received signal ***[tpb200:32193] Signal: Bus error (7)[tpb200:32193] Signal code: (2)[tpb200:32193] Failing at https://www.open-mpi.org/community/lists/users/2009/08/10200.php address: 0x2a998f4120[tpb200:32193] [ 0] /lib64/tls/libpthread.so.0 [0x2a95f6e430][tpb200:32193] [ 1] /opt/openmpi/1.2.gcc3/lib/libmpi.so.0(ompi_free_list_grow+0x138) [0x2a9568abc8][tpb200:32193] [ 2] /opt/openmpi/1.2.gcc3/lib/libmpi.so.0(ompi_free_list_resize+0x2d) [0x2a9568b0dd][tpb200:32193] [ 3] /opt/openmpi/1.2.gcc3/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_procs_same_base_addr+0x6bf) [0x2a98ba419f][tpb200:32193] [ 4] /opt/openmpi/1.2.gcc3/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x28a) [0x2a9899a4fa][tpb200:32193] [ 5] /opt/openmpi/1.2.gcc3/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xe8) [0x2a98889308][tpb200:32193] [ 6] /opt/openmpi/1.2.gcc3/lib/libmpi.so.0(ompi_mpi_init+0x45d) [0x2a956a32ed][tpb200:32193] [ 7] /opt/openmpi/1.2.gcc3/lib/libmpi.so.0(MPI_Init+0x93) [0x2a956c5c93][tpb200:32193] [ 8] a.out(main+0x1c) [0x400a44][tpb200:32193] [ 9] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x2a960933fb][tpb200:32193] [10] a.out http://users.open-mpi.narkive.com/KOi0eCHf/ompi-users-v1-2-bus-error-tmp-usage [0x40099a][tpb200:32193] *** End of error message ***... lots of the above for each process ...mpirun noticed that job rank 0 with PID 32040 on node tpb200 exited on signal 7 (Bus error).--/--If I increase the size of my ramdisk or point $TMP to a network filesystem then jobs start and complete fine, so it's not a showstopper, but with v1.1.4 (or LAM v7.1.2) I didn't encounter this issue with my default 1m ramdisk (even with np > 100 ). Is there a way to limit /tmp usage in Open MPI v1.2?Hugh Ralph Castain 2007-03-20 20:37:31 UTC PermalinkRaw Message One option would be to amend your mpirun command with -mca btl ^sm. Thisturns off the shared memory subsystem, so you'll see some performance lossin your collectives. However, it will reduce you
Handbook Research People Research Students Outreach Events Department Seminars Intranet People Research Students Simon Hammond Teaching High Performance Computing (CS402) Seminars 2007 MPI Errors MPI Errors Some students have reported getting errors when running MPICH programs. The error message usually states http://www2.warwick.ac.uk/fac/sci/dcs/people/research/csrcbc/teaching/hpcseminars/seminars07/mpierrors/ that there are no free ports in the MPICH_PORT_RANGE. No Free Ports in MPICH_PORT_RANGE Cause: The errors are usually caused because of program crashes which do not free the sockets available to the system. Hence over http://www.spinics.net/lists/linux-rdma/msg10166.html successive runs of mpiexec or mpirun the ports all become used up (leaving none free to start new programs). Fix: Type: killall mpd killall python2.3 in a normal command line window. This should kill all of bus error the running MPICH daemons - not this will kill running MPI programs as well (not just the 'zombie' ones). Signal 11 or Signal 9 Cause: Usually caused by incorrect use of pointers or incrementing outside an array. Technical cause is usually a segmentation fault for Signal 11 - i.e. a pointer pointed to a location in memory outside of the programs space. Signal 9 usually means a pointer pointed to some openmpi signal bus area of code within the program which it should not. Fix: Debug code, check pointers and array subscripts in particular. For MPI programs check that you are not sending more data than there is in an array. Signal 10 Cause: Signal 10 is very rare on UNIX/LINUX systems and indicates that there has been a 'bus error.' Apparently a large proportion of these errors come from incorrect assembly instructions being written to the CPU. This can occur in poorly written software by accident - some compilers will emit badly grouped instructions. You may also need to check that you have used the correct 'bit' compiler - i.e. have you used the 64-bit compiler on a 32-bit platform? Fix: Check your pointer and memory references - this error can occur if a reference/pointer is poorly assembled (using addition/multiplication). It is very rare and very difficult to find in code. Also check the compiler and Operating System you are using (uname -a). Signal 13 Cause: Pipe failure - one process is trying to write to a process but there is no process to receive the data. This is quite unlikely but can happen with some MPICH programs where the runtime is listening for application output to root to the parent node. Fix: Exit all MPD's an
Tue, 22 Nov 2011 01:51:14 +0000 (GMT) Reply-to: Lukas Razik