Problem with start up of the SLURM controller daemon

I’m trying to configure SLURM on an Ubuntu 23.10 system so that it uses MySQL via slurmdbd. This is a continuation of an earlier question which I solved through somewhat random guessing…

The funny thing is that the SLURM controller (slurmctld) fails to start upon boot. However, when I manually restart the service, it appears fine.

For example, if I type sudo service slurmctld status after booting, I see these messages:

Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: error: Sending PersistInit msg: Connection refused
Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: No memory enforcing mechanism configured.
Feb 03 17:10:27 mycomputer slurmctld[1682]: WARNING: MYSQL_OPT_RECONNECT is deprecated and will be removed in a future version.
Feb 03 17:10:27 mycomputer slurmctld[1682]: slurmctld: error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)
Feb 03 17:10:27 mycomputer slurmctld[1682]: slurmctld: fatal: You haven't inited this storage yet.
Feb 03 17:10:27 mycomputer systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Feb 03 17:10:27 mycomputer systemd[1]: slurmctld.service: Failed with result 'exit-code'.

which is similar to the information in the /var/log/ log file. However, if I restart it with sudo service slurmctld restart, without changing any configuration files, it starts up with this in the log:

Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Recovered information about 0 jobs
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Recovered state of 0 reservations
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: read_slurm_conf: backup_controller not specified
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Running as primary controller
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: No parameter for mcs plugin, default values set
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: mcs: MCSParameters = (null). ondemand set.
Feb 03 23:23:02 mycomputer slurmctld[30777]: slurmctld: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,...

And it seems fine now.

My only guess is that it might have to do with the order in which slurmdbd, slurmd, and slurmctld services are started. But I have been assuming that the default order is correct. Perhaps this assumption is wrong?

Asked By: Ray

||

The defaults for slurmctld.service and slurmd.service are missing an ordering dependency on mysql.service. Let’s add one two (Thanks @Ray for the clarification).

Create a file named /etc/systemd/system/slurmctld.service.d/99-mysql-ordering-askubuntu-1502374.conf:

[Unit]
# This will append the missing dependency to the defaults
After=slurmdbd.service

Create a file named /etc/systemd/system/slurmd.service.d/99-mysql-ordering-askubuntu-1502374.conf:

[Unit]
# This will append the missing dependency to the defaults
After=slurmctld.service

Then reboot.

Answered By: Daniel T
Categories: Answers Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.