Cannot Connect To Server Errno=15096 Error Getting Connection To Socket
here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of this site About trqauthd Us Learn more about Stack Overflow the company Business Learn more about hiring could not connect to trqauthd developers or posting ads with us Stack Overflow Questions Jobs Documentation Tags Users Badges Ask Question x Dismiss Join the Stack Overflow Community Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them; it only takes a minute: Sign up (errno=15096) socket_connect error (VERIFY THAT trqauthd IS RUNNING) up vote 0 down vote favorite I'm having a problem using sub to launch a perl script, in which various other scripts are embedded. I know that the scripts themselves are fine, because the execute with no error when run from the command line. However, if I try running them using qsub, I get an error. I've tried a million variants of these, including wrapping the perl script into a shell script and executing the shell script via sub, but nothing does it :-( The architecture is as follows: less test.sh: #!/bin/bash #./etc/sysconfig/pssc JOB_NAME="QSH_$(whoami)" NODE_NUM="1" NODE_PPN="${NODE_NCPUS}" HOURS="24" MINUTES="00" SECONDS="00" WALLTIME=${HOURS}:${MINUTES}:${SECONDS} #RES_LIST="nodes=${NODE_NUM}:ppn=${NODE_PPN}:walltime=${WALLTIME}" RES_LIST="nodes=${NODE_NUM}:ppn=${NODE_PPN}" DIR_WORK="${PBS_O_WORKDIR}" QUEUE="high" cd ${DIR_WORK} echo "`perl run.pl parameterfile1.txt`" I run the script as follows: qsub test.sh I do get some of the initial output correct and I can see it in the output file (test.sh.o), but it always crashes with the following error message (in the test.sh.e): less test.sh.e
be running out of available sockets whilequeuing/scheduling jobs. We routinely queue 10's of thousands of jobsat a time (up to around 30-40k total) and after several hundred or athousand I start seeing these errors in http://stackoverflow.com/questions/27895454/errno-15096-socket-connect-error-verify-that-trqauthd-is-running the logs and random jobs getdropped (not queued). I've tried limiting the rate at which I addjobs, adjusting the number of open files (ulimit -n 32788), adjustingTCP_WAIT timeout from 60 to 5 seconds(/proc/sys/net/ipv4/tcp_fin_timeout), http://torqueusers.supercluster.narkive.com/NVmo8fQp/socket-issues-in-torque-4-1-x etc. This is essentially abrand-new system with a default installation of Torque 4.1.1.09/08/2012 12:16:54;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,closed connections to fd 29 - num_connections=74 (select bad socket)09/08/2012 12:16:54;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,closed connections to fd 12 - num_connections=68 (select bad socket)09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,closed connections to fd 12 - num_connections=59 (select bad socket)09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,closed connections to fd 29 - num_connections=54 (select bad socket)09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,closed connections to fd 32 - num_connections=52 (select bad socket)09/08/2012 12:16:58;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,closed connections to fd 78 - num_connections=27 (select bad socket)09/08/2012 12:16:59;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request,closed connections to fd 11 - num_connections=17 (select bad socket)09/08/2012 00:13:32;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Badfile descriptor (9) in wait_request, Unable to select sockets to readrequests09/08/2012 00:14:43;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Badfile descriptor (9) in wai
Server version 4.2.7. I was trying to configure a Submission Node. Here are a sample of my qmgr -c 'p s" output. Firewall has allows the http://linuxtoolkit.blogspot.com/2015/03/unable-to-submit-via-torque-submission.html necessary traffic in outr # qmgr -c "p s" .......... set server acl_hosts = submission_node.cluster.spms.ntu.edu.sg set server acl_hosts += head_node.cluster.spms.ntu.edu.sg set server submit_hosts = submission_node.cluster.spms.ntu.edu.sg set server submit_hosts += head_node.cluster.spms.ntu.edu.sg set server allow_node_submit http://www.clusterresources.com/pipermail/torqueusers/2015-September/018316.html = True ....... After we ssh into the submission_node, and as I simulate as a user, I got this errors. Yes, the submission_node has been configured as a conventional client. socket_connect error (VERIFY connect to THAT trqauthd IS RUNNING) Error in connection to trqauthd (15137)-[could not connect to unix socket /tmp/trqauthd-unix: 111] socket_connect error (VERIFY THAT trqauthd IS RUNNING) Error in connection to trqauthd (15137)-[could not connect to unix socket /tmp/trqauthd-unix: 111] socket_connect error (VERIFY THAT trqauthd IS RUNNING) Error in connection to trqauthd (15137)-[could not connect to unix socket /tmp/trqauthd-unix: 111] Unable to communicate with head_node(10.10.10.20) Communication failure. qsub: cannot connect not connect to to server head_node (errno=15137) could not connect to trqauthd Taking a look at the Torque 4.2.7 documentation, the documentation mentioned that you have to make sure the submission node have trqauthd script at /etc/init.d if you are using RH / CentOS. You can easily scp the /etc/init.d/trqauthd to the submision node From the head_node # scp -v /etc/init.d/trqauthd root@submssion_node:/etc/init.d/ Create a /etc/hosts_equiv file # touch /etc/hosts_equiv Put the Submission_Node file name at the /etc/hosts.equiv of the head_node submission_node At the Submission_Node, start the trqauthd service # service trqauthd start Now trying submitting as a normal user Posted by kittycool at 4:57 PM Labels: Torque No comments: Post a Comment Newer Post Older Post Home Subscribe to: Post Comments (Atom) Follow by Email Total Pageviews Popular Posts xrdp_mm_process_login_response: login failed SSH Error : Permission denied (publickey, gssapi-with-mic,password) "Device eth0 does not seem to be present" on cloned CentOS VM Starting VNC server: no displays configured on CentOS 6 How to fix a broken yum for CentOS Connection activation failed Device not managed by NetworkManager Fixing Authentication is requried to set the network proxy used for download packages for CentOS 6 Usage of Environment Modules on CentOS and in Cl
manage the pbs_server. Next message: [torqueusers] About the hostname specification used in torque and maui. Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] 2015-09-28 23:17 GMT+08:00 Andrus, Brian Contractor