Rackham unavailable -- SOLVED: Rackham available

2017-11-09

2017-11-17 09:35 Rackham is now back in regular service.

Login nodes are now open on Rackham, and jobs are expected to run as usual on Friday morning.

It was decided to temporarily close down the Rackham cluster last Thursday when several disks on Crex reported themselves broken. The problems now seems solved, and we're awaiting results from the last tests before Rackham is fully back in service.
 

Thursday 2017-11-09

This morning, Rackham's storage system Crex reported multiple disks as broken, leaving no redundancy remaining. We believe this is very likely caused by the power outage on Tuesday this week. To prevent actual data loss, we have decided to turn the storage system off while waiting for replacements disks to arrive. This means Rackham will be turned off too.

This is very far from the normal procedure. At UPPMAX we find broken disks daily and it is usually not an issue for anyone. When large number of disks fail at the same time (in normal conditions very rare), we risk losing the redundancy, and need to step in.

We expect to receive new disks within a day.

Update Friday 2017-11-10 12:00

Spare parts have been sent but will arrive on Monday. Rackham will be offline during the weekend, and at least part of Monday too.

Update Monday 2017-11-13 15:50

Spare parts received and replaced. The problem unfortunately persists. We hope to recieve additional spare parts tomorrow. Course administrators for course project beginning this week, please contact support@uppmax.uu.se so we can let your project run on Milou instead.

Update Tuesday 2017-11-14 17:00

Together with the vendor we have begun ruling out issues with the IOMs (Input Output Modules) that we initially replaced and believed was causing the problems. We are now performing software maintenance to rule out problems with corrupt firmware and/or software. We hope to know the outcome by tomorrow.

Update Thursday 2017-11-16

Login nodes are now open on Rackham, and jobs are expected to run as usual tomorrow morning.

We have changed two controllers and upgraded all software on Crex, including IO modules. 8 of 15 disks previously reported broken have successfully been rebuilt. When all 15 disks are rebuilt without error, we'll let your jobs run as usual. We expect all tests to be finished without any errors by tomorrow morning. 

Nothing indicates any data loss of user data, which has been our top priority. However, jobs are affected that were running when Rackham shut down for the power outage and for this incident the day after. Check result files and Slurm output files to see which jobs you need to resubmit.  

Until the last disks have been recovered completely and sucessfully passed our tests, we're only opening login nodes. There will be no access to compute nodes until tomorrow morning, so Slurm queues are stopped. But it's possible read and write to files on Crex and to submit new jobs to the Slurm queue, which will make them start as soon as possible.

UPPMAX

System News