Lsf Error Codes
Contents |
>ExampleComments1Catchall for general errorslet "var1 = 1/0"Miscellaneous errors, such as "divide by zero" and other impermissible operations2Misuse of exit code 137 shell builtins (according to Bash documentation)empty_function() {}Missing keyword or exit code 1 linux command, or permission problem (and diff return code on a failed binary file comparison).126Command
Exit Code 9
invoked cannot execute/dev/nullPermission problem or command is not an executable127"command not found"illegal_commandPossible problem with $PATH or a typo128Invalid http://www.ibm.com/support/knowledgecenter/SSETD4_9.1.3/lsf_admin/job_exit_codes_lsf.html argument to exitexit 3.14159exit takes only integer args in the range 0 - 255 (see first footnote)128+nFatal error signal "n"kill -9 $PPID of script$? returns 137 (128 + 9)130Script terminated by Control-CCtl-CControl-C is fatal error signal 2, (130 = 128 http://tldp.org/LDP/abs/html/exitcodes.html + 2, see above)255*Exit status out of rangeexit -1exit takes only integer args in the range 0 - 255
integration exit values Parent topic: Troubleshooting Why did my job exit? LSF collects job information and reports the final status of a job. Traditionally jobs finishing normally report a status of 0, which usually means the job has finished normally. Any non-zero http://sunray2.mit.edu/kits/platform-lsf/7.0.6/1/guides/kit_lsf_guide_source/lsf_config_ref/lsf_exit_code.html status means that the job has exited abnormally. Most of the time, the abnormal job exit is related either to the job itself or to the system it ran on and not because of an LSF error. This document https://github.com/PlatformLSF/platform-python-lsf-api/issues/3 explains some of the information LSF provides about the abnormal job termination. How LSF translates events into exit codes The following table summarizes LSF exit behavior for some common error conditions. Error codition LSF exit code Operating system System exit code exit code equivalent Meaning Command not found 127 all 1 or 127 Command shell returns 1 if command not found. If the command cannot be found inside a job script, LSF return exit code 127. Directory not available for output 0 all 1 LSF sends the output back to user through email if directory not available for output (bsub -o). LSF internal error -127, 127 all N/A RES returns -127 or 127 for all internal problems. Out of memory exit code 1 N/A all N/A Exit code depends on the error handling of the application itself. LSF job states 0 all N/A Exit code 0 is returned for all job states Host failure If an LSF server host fails, jobs running on that host are lost. No other jobs are affected. At initial job submission, you must submit a job with specific options for them to be automatically rerun from the beginning or restarted from a checkpoint on another host if they are lost because of a host failure. If a job is submitted with bsub -r or to a queue with RERUNNABLE set, it reruns automatically on host failure. If a job is submitted with bsub -k or to a checkpointable queue or application profile, it can be restarted if the host fails and the checkpoint succeeds. If all of the hosts in a cluster go down, all running jobs are lost. When a host comes back up and takes over as master, it reads the lsb.events file to get the state of all batch jobs. Jobs that were running when the systems went down are assumed to have exited, and email is sent to the submitting user. Pending jobs remain in their queues, and are scheduled as hosts become available. Exited jobs A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes in
Sign in Pricing Blog Support Search GitHub This repository Watch 11 Star 21 Fork 13 PlatformLSF/platform-python-lsf-api Code Issues 5 Pull requests 2 Projects 0 Pulse Graphs New issue jobFinish exitStatus has exitcodes not in lsbatch.h / lsf.h #3 Closed PeteClapham opened this Issue Sep 29, 2014 · 7 comments Projects None yet Labels None yet Milestone No milestone Assignees No one assigned 2 participants PeteClapham commented Sep 29, 2014 Hello Is there are way to add exit detail in a consistent fashion to the output of jobFinishLog.exitStatus. We are seeing a number of interesting numers, including: 6400, 8704, 33280, 34304, 35840, 256, 512, 65280 etc etc. Some appear to be a bit extension of the translated bhist values BUT this seems very inconsistent and there doesn't appear to be a hook for the translated exit cause as viewed within bacct -l . eg 31232 -> 122 What is 122 ?! 33280 -> 130 For what reason ? Mem or CPU limit reached ? Is it possible to either: Make the exit codes consistent with bhist / bacct and / or provide a variable hook for the exit hint (i.e. job exceeded memory limit) as well as a consistent exit code ? Many thanks and apologies if I'm missing something obvious. Pete hanhiver commented Oct 1, 2014 LSF job exit codes Exit codes are generated by LSF when jobs end due to signals received instead of exiting normally. LSF collects exit codes via the wait3() system call on UNIX platforms. The LSF exit code is a result of the system exit values. Exit codes less than 128 relate to application exit values, while exit codes greater than 128 relate to system signal exit values (LSF adds 128 to system values). Use bhist to see the exit code for your job. How or why the job may have been signaled, or exited with a certain exit code, can be application and/or system specific. The application or system logs might be able to give a better description of the problem. Note: Termination signals are operating system dependent, so signal 5 may not be SIGTRAP