UPPMAX shutdown due to cooling failure -- FIXED

2017-09-26

The external cooling failed for (as of yet) unknown reasons. All clusters and storage systems were shutdown in order to prevent permanent hardware damage.

Please refrain from polling the support for updates and questions. We will update this article when new information becomes available.

Around 19:40 today the alarms about high temperatures in the computer room started to reach UPPMAX staff.

At 20:03 the temperature in the computer room reached critical levels and we where forced to shut down several systems including Irma, Milou and Rackham.

We still have no idea what caused the supply of cooling to the computer room to  fail but we will of course investigate this.

We are sorry for the problems this might have caused you and your research but it was necessary to shut down the systems in order to prevent permanent damage to the hardware.

UPDATE WEDNESDAY AT 0815 HOURS

For some reason the main cooling curcuit at Ångströmlab had stopped and the two main pumps where not running. They had commenced emergency shutdown due to low pressure in the system.

Bravida and Akademiska hus where at the site approx 19:30 and they finally got the pumps running again around 23:15.

This morning at 07:50 we began to restart our systems. This will most likley take the whole day and maybe more. We will continue to update this post about our progress.

UPDATE WEDNESDAY AT 1250 HOURS

Please note that any jobs that were still running yesterday evening, when we had to stop all systems, will need to be resubmitted. When you run "finishedjobinfo", they will probably be marked with jobstate=NODE_FAIL. Jobs that started after that might run into strange problems because of bad connections to storage systems. We are sorry about these problems.

Jobs that are still waiting in a Slurm queue will probably run without problems, when we put the systems back in production

The cooling medium (water) in the house complex (Ångströmlab), where UPPMAX's computer room is located, is leaking somewhere, but no one knows yet where.

UPPMAX can probably not put the systems in production until that problem is solved, because future repair work might set our computer room (again) without cooling. (And any jobs that we allow to run at that time would crash.)

We have decided to spend the waiting time doing already now, what we had planned for the maintenance on Wednesday next week.

So we are going to upgrade Bianca, Fysast1, Grus, Milou, and Rackham. And instead we no longer plan any maintenance for next week.

We plan to upgrade also Irma, today or tomorrow. That will be a little more difficult, due to current problems with storage system Lupus.

UPDATE WEDNESDAY AT 1640 HOURS

We have upgraded Fysast1, Milou and Rackham, and now allow you to login, if you have a project there.

Upgrade of Bianca and Irma will continue tomorrow.

The cooling problem is not solved. Someone will continue to add new water to the system, day and night, but the leak is not found.

UPPMAX anticipates that the future repair will create too much heat in the computer room. We do not want to crash running jobs, when we will (again) need to stop the compute nodes, so we will not unlock the Slurm queues yet.

Hopefully this will be solved tomorrow, Thursday.

UPDATE THURSDAY AT 1150 HOURS

Akademiska hus (AU) who is responsible for the cooling system of the Ångström laboratory (where our server hall is located) reports that the leak have still not yet been discovered. AU is refilling the coolant regularly, and will continue to do so until the leak is found.

We have started the queues on Milou, Rackham and Fysast1 again, but we may be forced to stop the queues and shutdown the hall once again depending on the outcomes of the ongoing investigation by AU.

Update Monday at 1615 hours

The leak is not yet found, but the system does not leak any more. It looks like it has self-repaired.

Today we have started also Irma and Bianca, and thus everything is back in production.

Happy computing!

System News