Login issue for new Bianca projects -- FIXED
A network problem has been detected on Bianca causing logins to fail for a few of the most recent Bianca projects . We are working on fixing the problem, and expect to Bianca fully working again very soon.
Maintenance window -- COMPLETED
Monthly maintenance window begins at 0900 hours on the first Wednesday of the month. (That is today.)
This time we will:
- Upgrade Slurm, Linux kernel and other system software on Bianca, Dis, Fysast1, Irma, Milou, and Rackham.
- Upgrade Linux kernel and other system software on Castor and Grus.
Bianca and Grus will be unavailable while we service them.
We will restart all login nodes of Fysast1, Irma, Milou and Rackham, probably only once.
Slurm queues on Fysast1, Irma, Milou and Rackham will be stopped, but access to Slurm commands will mostly work during the day.
Slurm queues on Bianca will be stopped and, most of the day, logins to Bianca will not be possible.
We plan to keep you informed about out progress with the maintenance with updates here.
UPDATE 2018-01-10, 16:00
Irma is up and running. Bianca, Milou, Rackham and Fysast1 are still down. We will continue security upgrades tomorrow (Thursday) morning.
UPDATE 2018-01-11, 15:15
Irma, Milou, Rackham and Fysast1 are up and running. Bianca is still being tested. Hopefully Bianca will be back today.
UPDATE 2018-01-11, 16:00
Bianca is now up, however, graphical login is not working right now. Text login works fine (http://www.uppmax.uu.se/support/user-guides/bianca-user-guide/).
We're still working on Dis and expect it to be up by tomorrow.
Extension of lupus
The vendor visited us last week and did the physical installation of the lupus extension. Unfortunately, some parts were not correct and we're currently waiting for exchanges that are expected to arrive this week.
UPPMAX staff back after the holidays
We hope 2018 has been good to you so far! UPPMAX staff is back after the holidays and we're focusing on support tickets that have built up over the holidays.
Reduced staff availability over the coming holidays combined with lots of tickets
Most of our staff is on vacation over the coming holidays. You can contact us using regular channels, but response times for support questions might be longer than normal. We are sorry for the inconvenience.
First week of January, most of us are back again.
We also have a lot of tickets about transfer from Milou to Rackham/Bianca and we think there might be hundreds of last minute requests in January. Be prepared the process of getting a transfer project takes some time.
If you want to continue your Milou project, make sure you have applied for a storage project and compute project on Rackham (for non-sensitive data), or a project on Bianca (for sensitive data). http://uppmax.uu.se/support/getting-started/moving-your-research-from-milou-to-rackham/
Creation of new Bianca projects currently on hold -- FIXED
The creation of new Bianca project are currently on hold. If your project is scheduled to start today you will be unable to login.
milou2 rebooted on Friday 2017-12-08 at 03:52
milou2 rebooted on Friday 2017-12-08 at 03:52
milou2 rebooted on Wednesday 2017-12-06 at 03:58
milou2 rebooted on Wednesday 2017-12-06 at 03:58
Updates from SUPR are temporarily disabled
We are performing a change in our infrastructure today starting at 13:00. This change will temporarily stop updates from SUPR reaching UPPMAX. If you have for example recently joined or added a member to a project, you will have to wait before the change becomes visible at UPPMAX.
Fix for broken SSH-connections to the UPPMAX Cloud
If you regularly end up with broken SSH-connections ("broken pipe") to your virtual machine in the UPPMAX region, please use the SSH option ServerAliveInterval. See below for an example.
Issue with the Intel License server
At this moment there is an issue with the Intel license server. You will be unable to use the icc compiler and Intel tools until this issue is resolved.
UPPMAX support low on staff Monday 20/11
The UPPMAX support will be low on staff on Monday 2017-11-20 due to conference.
How to get a high job priority on Bianca
Support ticket system temporarily down --FIXED
Our support email address email@example.com was down for a couple of hours, but is back in service again.
Logging in to Bianca without Rackham
Bianca users outside of SUNET will be unable to login using rackham.uppmax.uu.se. We have created a temporary workaround.
Rackham unavailable -- SOLVED: Rackham available
2017-11-17 09:35 Rackham is now back in regular service.
Login nodes are now open on Rackham, and jobs are expected to run as usual on Friday morning.
It was decided to temporarily close down the Rackham cluster last Thursday when several disks on Crex reported themselves broken. The problems now seems solved, and we're awaiting results from the last tests before Rackham is fully back in service.
UPPMAX power outage -- FIXED
UPPMAX experienced a power outage in the server hall on Tuesday.
Problems with /sw on Bianca (now fixed)
The /sw part of Bianca was lost around 07:30 this morning due to an issue with the storage system. This may have caused failed jobs. The system was fixed 08:40.
Quick upgrade of Slurm 2017-11-02 -- COMPLETED
Maintenance window Wednesday 2017-11-01 -- COMPLETED
Monthly maintenance window begins at 09:00 hours on the first Wednesday of the month. (That is today.)
Issues with /sw/data during the week end
/sw/data from pica may have been unavailable for some jobs during the week end and some jobs may have failed because of this.
UPPMAX support system is down -- SOLVED
RT, the support system UPPMAX and all the rest of SNIC is using, is down.
It is located at NSC at Linköping University and the whole university has network problems.
This will make all email to and from firstname.lastname@example.org delayed until the network problem is fixed. So answers to Your support tickets will be delayed.
We now have contact with our support system and emails to email@example.com are reaching us again.
Slow home direcotories
Someone seems to be running something very I/O-heavy from the home directories. We are looking for these jobs and will terminate them if found, but it's less than certain that we'll find them.
We found the guilty jobs and are termintating them and have notified the user not to do that again.
Accident on Irma caused jobs to fail with status NODE_FAIL
We sadly inform you that today at 17:02:37 a human error caused the compute nodes on Irma to reboot. The jobs running was canceled and will show up with status NODE_FAIL. The accident occured while investigating an issue with the storage network. We are very sorry about this.
UPPMAX shutdown due to cooling failure -- FIXED
lupus failover issue -- FIXED
Maintenance indication in output from command jobinfo
UPPMAX made a small change in "jobinfo" output.
In the REASON column for waiting jobs, "(Maintenance)" is shown for jobs that can not start before the next maintenance reservation.
Please note that maintenance reservations many times are moved forward to next month before the actual maintenance window.
Many Irma compute nodes lost electric power -- FIXED
Three racks of Irma's compute nodes lost power,because an automatic fuse shut down.
Some jobs were lost due to this. We are very sorry about that. Please rerun those jobs that were affected.
It looks like nodes i[167-250] were affected.
So what was the reason? It looks like an ethernet switch diied, possibly short circuited, so the automatic fuse shut down, getting more switches and the compute nodes to go down.
We have error reported to our support vendor. Until the bad ethernet switch has been repaired or replaced, Irma runs with a fewer number of compute nodes.
Update at 0950 hours
Now only nodes i[179-226] are down.
Maintenance window Wednesday 2017-09-06" -- FINISHED
milou2 rebooted August 28
milou2 rebooted Monday 2017-08-28 at 19:51.
Replacing (nearly) all disks on Irma's compute nodes -- DONE
We're restarting irma-q for technical reasons. The slurm queue system may be unavailable for submitting/verifying job status for a few minutes.
milou2 rebooted August 19
milou2 rebooted on Saturday 2017-08-19.
Bianca's storage system Castor had a hiccup yesterday Thursday -- FIXED
Maintenance window Wednesday 2017-08-02 -- FINISHED
Unexpected reboot of Pica at Monday morning.
Restart of two Milou login servers today Thursday
Lower service level during UPPMAX holidays
Part of storage system Pica is still very slow
Pica was partly restarted just now, please look for problems in your job output
UPPMAX had to restart part of storage system Pica, because it worked too slowly with nearly no read/write traffic.
The restart was done a little after 1300 hours.
For Rackham users, this meant that you might have had problems with reading and writing to your home directory.
For Milou users, this meant that you also might have had problems with reading and writing to your home directory. But for Milou users, also reading from /sw (where the modules live) and reading and writing to some project directories were affected.
Please look one extra time for problems in your job output, for jobs running at this time.
We are sorry for the inconvenience.
On Milou and Rackham, very difficult to login or otherwise use /home directories -- FIXED
UPPMAX has problem with an extremely slow access to /sw (where e.g. modules live) and home directories on Milou, and to home directories on Rackham.
Because of that, it is very difficult to login to Milou and Rackham.
We will investigate the source of this problem, and will report any success as updates here.
Update at 1310 hours
We restarted part of Pica, and that solved the problem
Hopefully your jobs will continue without problems, but please be careful and look once extra time for errors in your job output.
SUPR and C3SE website down
SUPR and C3SE websites are down at the moment. This prevents you from using SUPR at the moment. Please try again later
No maintenance planned for today's maintenance window
First (non-holiday) Wednesday of each month is UPPMAX's normal, planned maintenance window.
But today we will do no maintenance.
Next maintenance window is 2nd of August.
Restart of login server milou-f Tuesday morning -- FINISHED
File system mounts of Pica volumes was not working correctly.
This was fixed by a restart of the server. Now it works much better.
We are sorry about any inconvenience for you due to this.
Lost contact with Milou nodes m[1-48] for an hour this morning -- FIXED
From approximately 0800 hours to 0910 hours this morning, an ethernet switch in Milou lost power, making 48 nodes unavailable.
Two jobs got NODE_FAIL when trying to start, and interactive work on these nodes was denied. Otherwise, we seem to have had no problems with the temporary network loss.
Singularity is available
Urgent kernel upgrade -- FINISHED
Today we are performing an urgent kernel upgrade on Milou, Fysast1, Rackham, Irma, and Bianca. Login nodes will be restarted during the day. No running jorbs or queues are stopped. We will update on the progress here in System News during the day.
UPDATE 16:00 - Update completed.
Intelmpi performance issues
Bianca graphical login now working
Uses Thinlinc Web Access. Not X-forwardning.
Maintenance window Wednesday 2017-06-07 -- FINISHED
Maintenance starts at 0900 hours and will probably last all day long. This time, we will:
Upgrade kernel and other system software on all nodes of Bianca, Fysast1, Grus, Irma, Milou, and Rackham.
Upgrade firmware on two controllers of Crex.
Replace Battery Backup Unit of Lupus.
Reconfigure internal networks of Bianca.
We will restart all login nodes during the day. Otherwise Fysast1, Irma, Milou, and Rackham will be available, probably only with small disturbances with the Slurm connection. Jobs on these four UPPMAX clusters will continue to run during maintenance.
Biianca will probably be unavailable all day, with stopped Slurm queues.
Storage system Grus will be unavailable during the upgrade.
This news text will be updated during the maintenance, so here you will be able to see when the maintenance has finished.
Update at 1120 hours
Firmware on storage system Crex has been upgraded.
Storage system Castor has been upgraded.
Login nodes have been upgraded . We are now restarting them.
Update at 1300 hours
Login nodes has been restarted and checked.
Battery Backup Unit of Lupus has been replaced.
Storage system Grus has been upgraded.
Upgrade of Bianca proceeds slowly but without any known problems, so we still think that we will finish today.
Update at 1600 hours
Only Bianca maintenance is left to do. We have upgraded everything. Now we are reconfiguring internal networks. We will also make several function tests before allowing new logins. We still plan to finish today.
Update at 1720 hours
We have finished also maintenance of Bianca, and will now close the maintenance window. Please get in touch if something has stopped working.
Next maintenance window is Wednesday July 5th.