Problem with Slurm on Rackham and Milou


There is currently a problem with the Slurm master node which affects users on Rackham and Milou. We are investigating.

The problem results in various timeouts for example:

"squeue: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to rackham-q:7031: Unable to connect to database"

This also affects tools such as projinfo and jobinfo.

Update 15:50

The problem is related to the storage driver, which locks the CPU enough for the watchdog to start complaining. We will continue to investigate, but the problem should be less severe now.

System News