Rackham is now open for all users
2017-03-08
All active Tintin projects (exception Tintin-Fysast1, please see below) have been migrated to Rackham. All UPPMAX users should now have access to Rackham.
Dear former Tintin user,
UPPMAX has now migrated all active Tintin projects to Rackham, except
for combined Tintin-Fysast1 projects (you should know, if you belong
to one of those three projects.)
Fysast1 projects have not moved to Rackham, because there is no
common storage system for Fysast1 and Rackham. (If needed, please
apply for a project on Rackham.)
If you belong to an active Tintin project, it is now changed into a
Rackham project, and as a project member you should be able to login
to Rackham.
If you have any questions about Rackham, you are very welcome to
get in touch at the same, old e-mail address "support@uppmax.uu.se".
For a start: There are some differences between Tintin and Rackham,
as described below.
COMPUTE NODES
Each compute node on Rackham contains 20 compute cores, instead of the
16 compute cores on Tintin. This means that you have to rethink how
many nodes and cores you want to allocate in your jobs.
PROJECT DIRECTORIES
UPPMAX has moved the project directories from Tintin (storage system Pica)
to Rackham (storage system Crex).
The project directory on Tintin was divided into two parts, one backed-up
part, and a no-backup part. Each of these parts had a file space limit
of 512 GB.
On Rackham these two parts have joined, into one part, that must not
exceed 128 GB. It is backed up, except for the subdirectory that is named
"nobackup".
For those project directories that exceeded 128 GB in size, we have
given you one month from now to shrink your usage to below 128 GB.
The uquota command will give you information about your usage and
the limit.
If your project needs more space than those 128 GB, you may apply for
a storage project that gives an additional storage directory, which is
not backed up. You apply in SUPR, in round UPPMAX Storage 2017:
https://supr.snic.se/round/2017uppmaxstorage/
LOGIN NODES
As with Tintin, you log in with ssh to one of the (four) login nodes,
that have the common name "rackham.uppmax.uu.se". So, please open a
terminal and run
ssh username@rackham.uppmax.uu.se
If you want to run graphical applications you must specify -X or -Y,
i.e.
ssh -X username@rackham.uppmax.uu.se
YOUR PROJECTS ON RACKHAM
As usual, you get a list of your projects with the projinfo command.
SOFTWARE
Please note that most software that you may have compiled on Tintin,
should be recompiled on Rackham, because of the more modern computer
architecture.
For a complete list of currently installed software please run after
logging in:
module avail
You can search for modules with the "module spider"
command:
module spider name-of-software
The list of available software will be updated in the coming weeks. At
this time we have most of the compilers (icc, mpicc, gcc, gfortran and
javac) and interpreters (Python, Perl, R) and software (MATLAB,
GAUSSIAN, COMSOL, RStudio, OpenFOAM, GROMACS) installed.
If you are missing software and are unable to install it yourself, you
may ask for support at support@uppmax.uu.se.
SOME MORE INFORMATION
You are also welcome to read web page
http://uppmax.uu.se/resurser/systems/the-rackham-cluster
Best wishes,
-- UPPMAX Support Team <support@uppmax.uu.se>
SNIC-UPPMAX, Uppsala University
Home page: http://www.uppmax.uu.se
Twitter: https://twitter.com/uppmax
Old System News
-
milou2 rebooted August 28
milou2 rebooted on Monday 2017-08-28 at 19:51
-
milou2 rebooted August 19
milou2 rebooted on Saturday 2017-08-19.
-
Intelmpi
Intelmpi performance issues
-
Intelmpi
Intelmpi performance issues.
-
Issues with X11 on milou (X11Forwarding) -- SOLVED
We have observed and several users have reported issues with running X11 applications on Milou. We are investigating it.
-
milou2 and milou-b rebooted
The login nodes milou2.uppmax.uu.se and milou-b.uppmax.uu.se were rebooted 15:00 today (29th of May) due to some issues with the kernel NFS module.
-
Cooling stop at 17.00 hours the 23rd of May -- CANCELLED
-
Issues with certain project volumes for milou/pica 20170515 and onwards.
Some project volumes on pica are very heavily loaded and slow/next to unusable for interactive use. We're doing what we can to resolve this but can not promise any set time for when things will behave as normal again.
UPDATE: We've had some continuing issues with this due to some nodes not realizing when resources behave better, we're working on these issues but this may have caused disturbances like failed jobs or missing output.
-
Support may be slow May 11th and 12th due to conference
The UPPMAX system group hosts the spring 'SONC' conference where administrators from all SNIC-centers meet and discuss how to improve our centers. With many UPPMAX adminstrators being out of office during the conference (Thursday 11th and Friday 12th) the support will likely be less responsive.
-
slurm disturbance on milou 2017-05-10
Due to a misconfiguration active on a certain number of nodes around 12AM today, some jobs that were launched on milou could not start.
If you have jobs that were victims of this, they will likely show up as completed although with a very short run time (a few seconds).
-
Disturbances in Slurm today Tuesday -- finished
-
Maintenance window Wednesday 2017-05-03 -- finished
-
Slurm problems on Rackham -- fixed
-
Intel license server not responding --fixed
We have gotten reports that the Intel license server is not responding. We are investigating it. This might manifest itself with hangs or freezes during compilations.
-
Problem "Invalid account or account/partition..." --solved
We have identified a problem with the Slurm account database. If you just got added or created a new project you might get the following message when scheduling jobs "Invalid account or account/partition...". It affects primarily Rackham and Milou.
-
Problem with Slurm on Milou -- fixed
-
Interrupts in Slurm service on Rackham -- fixed
-
Bianca's storage system Castor has problems -- fixed
-
Resetting your password from the homepage is not working --fixed
Resetting your password from this page is currently not working. If you need to reset your password please contact support@uppmax.uu.se
Update 2017-04-18: This issue should now be fixed.
-
Funk-accounts and new certificates
Some of the shared funk-accounts used on Irma and Milou might stop working due to the IP-address change.
-
Maintenance window Wednesday 2017-04-05 -- finished
-
Smog will be decommissioned on Wednesday 5th of April
Smog will be decommissioned on Wednesday 5th of April. As previously mentioned the SNIC Cloud Team is currently working on bringing up a new cloud to replace Smog and join the other two regions in the SNIC Science Cloud project.
For questions ,please contact support@cloud.snic.se (and not the UPPMAX support queues).
-
Rackham2, one of Rackham's login nodes, got into problems -- now fixed
-
Maintenance window for Bianca Wednesday 2017-03-22 -- finished
-
Problem with file permissions in certain projects
-
Poor performance using Intel MPI on Rackham
We have idenfied performance issues when using Intel MPI on Rackham. In some cases you see a 10x slowdown (or worse) using Intel MPI compared to Open MPI. We are investigating this issue and hope to have it solved soon. For now, please use Open MPI.
-
Fixed: "Project p123456 may not run jobs on this cluster (rackham)"
An issue exist on Rackham affecting projects of the form "p123456". The projects are not allowed to run due to the monthly core allocation incorrectly being set to 0 hours. We are investigating why this happens.
Update 2017-03-10: The issue should now be fixed.
-
Rackham is now open for all users
All active Tintin projects (exception Tintin-Fysast1, please see below) have been migrated to Rackham. All UPPMAX users should now have access to Rackham.
-
Rackham will soon be open for all users
Many Tintin users have missed that Rackham will replace Tintin. We are currently migrating all projects from Tintin to Rackham and when this is done, all users will get access to Rackham. We will announce this per email and on our homepage.
-
Maintenance window Wednesday 2017-03-01 -- finished
-
Today we decommission Tintin
1st of March 2017 is the day we decommission Tintin. It will be replaced by the Rackham cluster. All projects on Tintin will be moved to our new Rackham cluster.
-
Creation of new UPPMAX user accounts will be delayed
-
Delayed approval of Account Requests -- fixed
We have identified a problem with the UPPMAX Account Request which unfortunately causes some delay before you can login to UPPMAX. We hope to complete the registration this week. You do not have to resubmit your Accounts.
-
Problem sending in support tickets using support@uppmax.uu.se -- fixed
There is currently a problem sending in support tickets to support@uppmax.uu.se. We are investigating and hope to have it fixed soon.
-
Rackham is now available
We are happy to announce that UPPMAX's cluster Rackham is now available!
-
Downtime due to power outage
Milou, Tintin, and Fysast-1 are back in production. Bianca is back in test production. Still working on Smog.
-
Milou2 now back again
The degraded RAID now fixed
-
Milou-f rebooted tuesday afternoon
Lustre file system problem
-
Milou1 rebooted (tuesday 14:00)
Totally inresponsible. Lustre file system problem (wich will be decommissioned tomorrow....)
-
Milou1 rebooted -- now with limited number of inodes on /scratch (/tmp)
We have now quota on the number of files in /scratch (/tmp).
100000 is maximum (per user). If you need more you have to use a compute node.
-
Gulo (including glob directory) decommissioned January 18
-
Milou2 down for reinstallation 13:50 (now waiting for spare parts)
Milou2 hasn't worked well for a while. We will give him a fresch restart.
-
Milou1 rebooted Thursday am11.00
Milou1 rebooted Thursday am11.
Problems with lustre file system.
-
Fysast1 down Wednesday
Fysast1 down Wednesday before lunch.
One power supply broken and the fuses for half the cluster was blown.
-
Milou1 rebooted Wednesday am11.00
Milou1 rebooted Wednesday am11.00
Problems with lustre file system
-
Maintenance window on Irma Wednesday 2017-01-11 -- FINISHED
-
Maintnenace window on Mosler/topolino Wednesday Jan 11 - FINISHED
We have a maintenance window coming up on January 11 from 9:00. Due to physical work, we need to shut down the system during the maintenance window this time so jobs won't run.
We will also likely be required to rebuild virtual nodes and will probably lose information about queued jobs.
Update 21:10: Maintenance is now finished and the system should be available again.
-
Poor performance on Milou and Tintin
-
Maintenance window Wednesday 2017-01-04 -- FINISHED
-
Milou2 rebooted friday morning
Milou2 rebooted at 06:01 due to a problem with the lustre filesystem