Problem with Slurm on Milou -- fixed
There is currently a problem with Slurm on Milou. We are investinging it. You will likely not be able to submit new jobs and you may encounter strange error messages when running 'sbatch' and 'interactive'.
An example of the error when running sbatch:
" sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)"
When running jobinfo:
Clustername not found
Update at 1055 hours
Problem is still not solved. For a while we will run with a much simpler priority handlling on Milou,that at least allows job submits.
Update at 1230 hours
Slurm functions normally again, but we wait a while before declaring that it is really fixed.
Update at 1250
The problem has disappeared, without any explanation. Let us hope that it does not return, and declare that the problem is fixed.
Issue with 'interactive' and creating slurm.out files
Problem with Slurm on Rackham and Milou
There is currently a problem with the Slurm master node which affects users on Rackham and Milou. We are investigating.
March maintenance day -- UPDATED Thursday 07:00
Wednesday 7th of March UPPMAX began our monthly service window. Systems and services may become unreachable during the day.
Files and directories may be hidden on Bianca -- SOLVED Wednesday
We have received reports of missing files and directories inside the /proj and /proj/nobackup directories on Bianca. Upon inspection the files are actually there, but are not shown by the "ls" command. If you are working on Bianca, you should be aware of this as for example jobs of type “process all files in directory X and compile the result” might finish fine but create false results due to missing input, thus risking incorrect results and conclusions.
A workaround was implemented on Wednesday 2018-03-08 that mitigates this issue.
Configuration problems on Milou and Irma - SOLVED
Slow home directories
Home directories have occasionally been extremely slow today. Nothing seems broken but the system is under a lot of pressure from time to time.
Rackham login issues -- SOLVED
We are currently seeing and receiving reports on login issues on Rackham.
The fat (256GB) Rackham nodes is currently unavailable -- SOLVED
The fat (256GB) Rackham nodes is currently unavailable due to an issue with Slurm. We are investigating this issue.
Rackham's storage system -- MONDAY: Queues released
Due to an issue with the storage system Crex the Slurm queue on Rackham is currently on hold. This is a summary of the problem.
No new jobs on Rackham 2018-02-09 11:15
We are experiencing problems with crex, the file system on Rackham. In order to not put more strain on the filesystems we will not allow new jobs to start at the moment. If you submit jobs they will be held in the queue.
The fysast1 cluster is back online
The Milou cluster is back on line
The Rackham cluster is back online
Bianca online again
The Bianca cluster is back online following our service window.
Maintenance window Wednesday 2018-02-07 -- CLOSED
For the February service we will install our new UPS, update Slurm on all clusters, extend the capacity of the storage system Lupus (for Irma), and of course perform the standard kernel and security updates.
The UPPMAX Cloud region will be unavailable Thursday 17:00-20:00 CET
A central switch will be restarted tomorrow Thursday 2018-01-31. The cloud will become temporarily unavailable from the outside i.e. Internet.
Problems with the 'interactive' and Slurm commands on Rackham
The Slurm master on Rackham is currently overloaded and you may experience sluggish Slurm behavior or timeout issues when running commands such as interactive, jobinfo and squeue. We are investigating this issue.
Some projects volumes on pica are slow
Some projects volumes on pica are slow, this may also possibly affect home directories.
Login issue for new Bianca projects -- FIXED
A network problem has been detected on Bianca causing logins to fail for a few of the most recent Bianca projects . We are working on fixing the problem, and expect to Bianca fully working again very soon.
Maintenance window -- COMPLETED
Monthly maintenance window begins at 0900 hours on the first Wednesday of the month. (That is today.)
This time we will:
- Upgrade Slurm, Linux kernel and other system software on Bianca, Dis, Fysast1, Irma, Milou, and Rackham.
- Upgrade Linux kernel and other system software on Castor and Grus.
Bianca and Grus will be unavailable while we service them.
We will restart all login nodes of Fysast1, Irma, Milou and Rackham, probably only once.
Slurm queues on Fysast1, Irma, Milou and Rackham will be stopped, but access to Slurm commands will mostly work during the day.
Slurm queues on Bianca will be stopped and, most of the day, logins to Bianca will not be possible.
We plan to keep you informed about out progress with the maintenance with updates here.
UPDATE 2018-01-10, 16:00
Irma is up and running. Bianca, Milou, Rackham and Fysast1 are still down. We will continue security upgrades tomorrow (Thursday) morning.
UPDATE 2018-01-11, 15:15
Irma, Milou, Rackham and Fysast1 are up and running. Bianca is still being tested. Hopefully Bianca will be back today.
UPDATE 2018-01-11, 16:00
Bianca is now up, however, graphical login is not working right now. Text login works fine (http://www.uppmax.uu.se/support/user-guides/bianca-user-guide/).
We're still working on Dis and expect it to be up by tomorrow.
Extension of lupus
The vendor visited us last week and did the physical installation of the lupus extension. Unfortunately, some parts were not correct and we're currently waiting for exchanges that are expected to arrive this week.
UPPMAX staff back after the holidays
We hope 2018 has been good to you so far! UPPMAX staff is back after the holidays and we're focusing on support tickets that have built up over the holidays.
Reduced staff availability over the coming holidays combined with lots of tickets
Most of our staff is on vacation over the coming holidays. You can contact us using regular channels, but response times for support questions might be longer than normal. We are sorry for the inconvenience.
First week of January, most of us are back again.
We also have a lot of tickets about transfer from Milou to Rackham/Bianca and we think there might be hundreds of last minute requests in January. Be prepared the process of getting a transfer project takes some time.
If you want to continue your Milou project, make sure you have applied for a storage project and compute project on Rackham (for non-sensitive data), or a project on Bianca (for sensitive data). http://uppmax.uu.se/support/getting-started/moving-your-research-from-milou-to-rackham/
Creation of new Bianca projects currently on hold -- FIXED
The creation of new Bianca project are currently on hold. If your project is scheduled to start today you will be unable to login.
milou2 rebooted on Friday 2017-12-08 at 03:52
milou2 rebooted on Friday 2017-12-08 at 03:52
milou2 rebooted on Wednesday 2017-12-06 at 03:58
milou2 rebooted on Wednesday 2017-12-06 at 03:58
Updates from SUPR are temporarily disabled
We are performing a change in our infrastructure today starting at 13:00. This change will temporarily stop updates from SUPR reaching UPPMAX. If you have for example recently joined or added a member to a project, you will have to wait before the change becomes visible at UPPMAX.
Fix for broken SSH-connections to the UPPMAX Cloud
If you regularly end up with broken SSH-connections ("broken pipe") to your virtual machine in the UPPMAX region, please use the SSH option ServerAliveInterval. See below for an example.
Issue with the Intel License server
At this moment there is an issue with the Intel license server. You will be unable to use the icc compiler and Intel tools until this issue is resolved.
UPPMAX support low on staff Monday 20/11
The UPPMAX support will be low on staff on Monday 2017-11-20 due to conference.
How to get a high job priority on Bianca
Support ticket system temporarily down --FIXED
Our support email address firstname.lastname@example.org was down for a couple of hours, but is back in service again.
Logging in to Bianca without Rackham
Bianca users outside of SUNET will be unable to login using rackham.uppmax.uu.se. We have created a temporary workaround.
Rackham unavailable -- SOLVED: Rackham available
2017-11-17 09:35 Rackham is now back in regular service.
Login nodes are now open on Rackham, and jobs are expected to run as usual on Friday morning.
It was decided to temporarily close down the Rackham cluster last Thursday when several disks on Crex reported themselves broken. The problems now seems solved, and we're awaiting results from the last tests before Rackham is fully back in service.
UPPMAX power outage -- FIXED
UPPMAX experienced a power outage in the server hall on Tuesday.
Problems with /sw on Bianca (now fixed)
The /sw part of Bianca was lost around 07:30 this morning due to an issue with the storage system. This may have caused failed jobs. The system was fixed 08:40.
Quick upgrade of Slurm 2017-11-02 -- COMPLETED
Maintenance window Wednesday 2017-11-01 -- COMPLETED
Monthly maintenance window begins at 09:00 hours on the first Wednesday of the month. (That is today.)
Issues with /sw/data during the week end
/sw/data from pica may have been unavailable for some jobs during the week end and some jobs may have failed because of this.
UPPMAX support system is down -- SOLVED
RT, the support system UPPMAX and all the rest of SNIC is using, is down.
It is located at NSC at Linköping University and the whole university has network problems.
This will make all email to and from email@example.com delayed until the network problem is fixed. So answers to Your support tickets will be delayed.
We now have contact with our support system and emails to firstname.lastname@example.org are reaching us again.
Slow home direcotories
Someone seems to be running something very I/O-heavy from the home directories. We are looking for these jobs and will terminate them if found, but it's less than certain that we'll find them.
We found the guilty jobs and are termintating them and have notified the user not to do that again.
Accident on Irma caused jobs to fail with status NODE_FAIL
We sadly inform you that today at 17:02:37 a human error caused the compute nodes on Irma to reboot. The jobs running was canceled and will show up with status NODE_FAIL. The accident occured while investigating an issue with the storage network. We are very sorry about this.
UPPMAX shutdown due to cooling failure -- FIXED
lupus failover issue -- FIXED
Maintenance indication in output from command jobinfo
UPPMAX made a small change in "jobinfo" output.
In the REASON column for waiting jobs, "(Maintenance)" is shown for jobs that can not start before the next maintenance reservation.
Please note that maintenance reservations many times are moved forward to next month before the actual maintenance window.
Many Irma compute nodes lost electric power -- FIXED
Three racks of Irma's compute nodes lost power,because an automatic fuse shut down.
Some jobs were lost due to this. We are very sorry about that. Please rerun those jobs that were affected.
It looks like nodes i[167-250] were affected.
So what was the reason? It looks like an ethernet switch diied, possibly short circuited, so the automatic fuse shut down, getting more switches and the compute nodes to go down.
We have error reported to our support vendor. Until the bad ethernet switch has been repaired or replaced, Irma runs with a fewer number of compute nodes.
Update at 0950 hours
Now only nodes i[179-226] are down.
Maintenance window Wednesday 2017-09-06" -- FINISHED
milou2 rebooted August 28
milou2 rebooted Monday 2017-08-28 at 19:51.
Replacing (nearly) all disks on Irma's compute nodes -- DONE