Uppsala Multidisciplinary Center for Advanced Computational Science

Bianca user guide

0. PREREQUISITES

In order to access Bianca you need to be a member of a SNIC SENS research project (these are called sensNNNNN where N represent a digit). SUPR will tell you if you are a member and the account page should list the account on the resource bianca (this can also be useful to verify your username).

Additionally, you must have accepted the SNIC Common User Agreement in SUPR. You can check whether you have done this by visiting your Personal Information page in SUPR and scrolling to the bottom. This is a common cause of problems logging in.

Once you are set up for login, this should also be reflected in SUPR through one or several additional account(s) at UPPMAX for the specific project(s) you are a member of.

1. Set up TWO factor authentication

Follow the instructions in Setting up two factor authentication.

Please note that you need to set up two factor authentication for UPPMAX and not for SUPR! If you have multiple two factor codes registered the correct is labeled UPPMAX (as seen below highlighted in green).

2. Login

Bianca is accessible from all SUNET IP addresses, i.e. all Swedish university networks. It is generally NOT accessible from other networks. 

The login procedure requires you to pass two separate authentication mechanisms (automatically connected together). The first one logs you into the general Bianca login node (which we call the jumphost). This is the step that requires two factor authentication. You will then be automatically redirected to your project's private login node, where you will get a new password prompt (unless you set up ssh-keys).

The user name you will use in the first step is your ordinary UPPMAX user name, followed by the project ID of the project you want to work on. One of the security measures on Bianca is that all projects are kept separate on their own virtual clusters, so you must tell Bianca which project's cluster you want to connect to.

Primary login (text login)

You can use any ssh-program in all normal platforms (Windows, Mac, Linux).

$ ssh -A <username>-<projid>@bianca.uppmax.uu.se
Ex.
$ ssh -A myname-sens2016999@bianca.uppmax.uu.se

As password you use your normal UPPMAX password directly followed by the six digits from the second factor application from step 1.

E.g. if your password is "VerySecret" and the second factor code is 123 456 you would type VerySecret123456 as the password in this step.

If the password is correct you will get a message of your projects login node status. It can be "up and running" or "down". If it is down it will automatically spin up, but this takes a few minutes. Then you will be automatically redirected to login at your project's private login node. To be able to login there you will have to give your UPPMAX password once again, but without the two factor authentication code this time. If your password is "VerySecret" you would type in VerySecret as password in this step. To skip this password in the future, you can use ssh-keys.

If the passwords have been entered correctly, you should now be connected and you will see the computer name at the start of the command line which looks something like this:

[myuser@sens2016999-bianca ~]$

This text login is limited to 50kBit/s, so if you create a lot of text output you will have to wait some time before you get your prompt back. The system supports all cut and paste mechanisms your client computer support, but of course you are not supposed to transfer any real data in or out through this mechanism — only commands and stuff like that.

Graphical login

Bianca does not support any X-forwarding, so to use graphical applications you need to have a full graphical desktop login. Bianca uses "Thinlinc" (in webaccess mode only) with XFCE as desktop environment for this. All you should need is a rather modern browser on any platform; we have tested on Chrome and Firefox.

Just browse to: https://bianca.uppmax.uu.se

The same principle with login name and password as with text logins: first step with user-project as username and password with password followed by second factor 6 digit number, and then second step to the login node for your specific project with normal username (and here you really have to type that – in text mode that is done automatically) and normal UPPMAX password (without second factor). The redirection to the correct project login node works automatically. If the login node is sleeping you will be informed what to do.

When you are logged into your graphical environment, the resizing and even fullscreen mode (in your browser) should work as expected.

Under the hidden tab in the left edge of the screen you can find a clipboard, some keys (where the keyboard versions often interfere with your local system) and the "disconnect" button. It is important to understand the difference of "disconnect session" and "end session". When you disconnect a session you will get back exactly in the same place you left the system; if you for example edit a file, your prompt will be in the same place you left it at the next login. If you use "logout" (end session) in the XFCE menus, the system will try to close all your windows and files and end the processes related to the login.

Bianca has a autodisconnect after 30 minutes of inactivity, and in the future it is possible that we implement some kind of "automatic log out from active graphical session". 

3. Transfer files to and from Bianca

For security reasons, there is no internet access on Bianca so you can't download from or upload files to the cluster directly. All files must be transferred through the wharf area of Bianca. The wharf has access to one of the folders in your project folder, but nothing else outside of it. The path to this folder, once you are logged into your project's cluster, is:

/proj/<projid>/nobackup/wharf/<username>/<username>-<projid>
E.g.
/proj/sens2016999/nobackup/wharf/myuser/myuser-sens2016999

To reach this catalog from the "outside" you need to use sftp.

To avoid subtle error with the transfers -- please make sure you have write permissions for "owner" on the source files and directories.

Using standard sftp client

$ sftp <username>-<projid>@bianca-sftp.uppmax.uu.se
Ex.
$ sftp myuser-sens2016999@bianca-sftp.uppmax.uu.se

Notice the different host name!

As password you use your normal UPPMAX password directly followed by
the six digits from the second factor application from step 1.

Ex. if your password is "VerySecret" and the second factor code is 123 456 you would type VerySecret123456 as the password in this step.

Once connected you will have to type the sftp commands to upload/download files. Have a look at the Basic SFTP commands guide to get started with it.

Please note that in the wharf you only have access to upload your files to the directory that is named:

<username>-<projid>
e.g.
myuser-sens2016999

so you will want to cd to that directory the first thing you do.

sftp> cd myuser-sens2016999

Alternatively, you can specify this at the end of the sftp command, so that you will always end up in the correct folder directly.

$ sftp <username>-<projid>@bianca-sftp.uppmax.uu.se:<username>-<projid>
E.g.
$ sftp myuser-sens2016999@bianca-sftp.uppmax.uu.se:myuser-sens2016999

sftp supports a recursive flag (put -r), but it seems to be very sensitive to combinations of different sftp servers and clients, so be warned... a bit later you can see a rough solution for bulk transfers.

Some other sftp client

Please notice that sftp is NOT the same as scp. So be sure to really use a sftp client -- not just a scp client.

Also be aware that many sftp clients use reconnects (with a cached version of your password). This will not work for Bianca, because of the second factor! And some try to use multiple connections with the same password, which will fail.

So for example with lftp, you need to "set net:connection_limit 1". lftp may also defer the actual connection until it's really required unless you end your connect URL with a path.

An example command line for lftp would be

lftp sftp://<username>-<projname>@bianca-sftp.uppmax.uu.se/<username>-<projname>/

Mounting the sftp-server with sshfs

This is only possible on your own system. UPPMAX doesn't have sshfs client package installed (security reasons). sshfs exists on Linux -- packages sshf in Ubuntu and fuse-sshfs (from EPEL) on RedHat-ish systems. It also exist on Windows and Mac OS X. When you have sshfs installed, you can mount the wharf. Here is only example with Linux client (as normal user):

mkdir ~/wharf_mnt

sshfs <username>-<projname>@bianca-sftp.uppmax.uu.se:<username>-<projname> ~/wharf_mnt

After that you can use wharf_mnt exactly as a local directory. For example with rsync.

To unmount it do:

fusermount -u ~/wharf_mnt

Bulk recursive transfer with only standard sftp client

It seems to be rather common with directory structures with symbolic links inside the directories that you should transfer. This is a very simple solution to copy everything in a specific folder (and follow symbolic links links) to the wharf.

==============
~/sftp-upload.sh
==============
#!/bin/bash
#sftp-upload.sh
find $* -type d | awk '{print "mkdir",$0}' 
find $* -type f | awk '{print "put",$0,$0}' 
find $* -type l | awk '{print "put",$0,$0}' 
-----------

With this script you can do:

cd /home/myuser/glob/testing/nobackup/somedata
~/sftp-upload.sh *|sftp -oBatchMode=no -b- <username>-<projid>@bianca-sftp.uppmax.uu.se:<username>-<projid>

The special "-b" makes the script stop on error.

4. Accounts

Note that the machine you arrive at when logged in is only a so called login node, where you can do various smaller tasks, like submitting Slurm jobs and examing their result files. For larger tasks you may find that you must use our batch system to continue your tasks on other machines within the cluster.

To allow a fair and efficient usage of the system we use  the SLURM resource manager to coordinate user demands. Read our SLURM user guide for detailed information on how to use SLURM. Please note that Slurm behaviour on Bianca is somewhat different in the following areas:

  • Each Bianca project has an independent Slurm installation of its own, so when you run commands jobinfo (or squeue), you will se only jobs belonging to your project cluster.
  • On Bianca in total there are a little more than 190 compute nodes. Most of those are distributed among the current project clusters, according to earlier demand and priority.
  • You can see a static view of a recent combined job queue, where the jobs of all project clusters are seen. The command is bianca_combined_jobinfo. This may give you some idea about the relative position of your queued jobs among the queued jobs of all project clusters.
  • Your queued jobs are normally started on the current compute nodes on your project cluster. If your project cluster currently has no compute nodes, or if there are free compute nodes elsewhere, some additional compute nodes may be moved to your compute cluster. Below, you can read a short description on how this is done.
  • There are no special devcore or devel nodes on your project clusters. On Bianca, they are the same nodes as normal core and node nodes, and are handled in the same way. So, currently devcore and devel jobs start as quickly or slowly as normal jobs. If you want them to start quicker, please use the --qos=interact or --qos=short flag to the sbatch or interactive command, to give them a higher priority.

How compute nodes are moved between project clusters

The total job queue, made by putting together job queues of all project clusters, is monitored, and acted upon, by an external program, named meta-scheduler.

In short, this program goes over the following procedure, over and over again:

  1. Finds out where all the compute nodes are: on a specific project cluster or yet unallocated.
  2. Reads status reports from all compute nodes, about all their jobs, all their compute nodes, and all their active users.
  3. Are there unallocated compute nodes for all queued jobs?
  4. Otherwise, try to "steal" nodes from project clusters, to get more unallocated compute nodes. This "stealing" is done in two steps: a/ "drain" a certain node, i.e. disallow more jobs to start on it; b/ remove the compute node from the project cluster, if no jobs are running on the node.
  5. Use all unallocated nodes to create new compute nodes. Jobs with a higher priority get compute nodes first.

The new compute node will probably be active in between 10 and 30 minutes, in total.

If there are high-priority jobs (jobs with a priority of 212000 or higher) in the queue they get a special handling to get them to start faster, normally in between one minute and 25 minutes.

Some Limits

  • There is a job walltime limit of ten days (240 hours).
  • We restrict each user to at most 5000 running and waiting jobs in total.
  • Each project has a 30 days running allocation of CPU hours. We do not forbid running jobs after the allocation is overdrafted, but instead allow to submit jobs with a very low queue priority, so that you may be able to run your jobs anyway, if a sufficient number of nodes happens to be free on the system.

Convenience Variables

  • $SNIC_TMP - Path to node-local temporary disk space

    The $SNIC_TMP variable contains the path to a node-local temporary file directory that you can use when running your jobs, in order to get maxiumum disk performance (since the disks are local to the current compute node). This directory will be automatically created on your (first) compute node before the job starts and automatically deleted when the job has finished.

    The path specified in $SNIC_TMP is equal to the path: /scratch/$SLURM_JOB_ID, where the job variable $SLURM_JOB_ID contains the unique job identifier of your job.

    WARNING: Please note, that in your "core" (see below) jobs, if you write data in the /scratch directory but outside of the /scratch/$SLURM_JOB_ID directory, your data may be automatically deleteted during your job run.
 

Details about the "core" and "node" partitions

A normal Bianca compute node contains a about 112 GB of RAM and sixteen compute cores. An equal share of RAM for each core would mean that each core gets at most 7 GB of RAM. This simple calculation gives one of the limits mentioned below for a "core" job.

You need to choose between running a "core" job or a "node" job. A "core" job must keep within certain limits, to be able to run together with up to fifteen other "core" jobs on a shared node. A job that cannot keep within those limits must run as a "node" job.

Some serial jobs must run as "node" jobs. You tell Slurm that you need a "node" job with the flag "-p node". (If you forget to tell Slurm, you are by default choosing to run a "core" job.)

A "core" job:

  • Will use a part of the resources on a node, from a 1/16 share to a 16/16 share of a node.

  • Must specify a number of cores less than 16, i.e.between "-n 1" to "-n 16".

  • Must not demand "-N", "--nodes", or "--exclusive".

  • Is recommended not to demand "--mem"

  • Must not demand to run on a fat node (see below, for an explanation of "fat").

  • Must not use more than 7 GB of RAM for each core it demands. If a job needs half of the RAM, i.e. 56 GB, you need to reserve also at least half of the cores on the node, i.e. 8 cores, with the "-n 8" flag.

A "core" job is accounted on your project as one "core hour" (sometimes also named as a "CPU hour") per core you have been allocated, for each wallclock hour that it runs. On the other hand, a "node" job is accounted on your project as sixteen core hours for each wallclock hour that it runs, multiplied with the number of nodes that you have asked for.

Node types

Bianca has two node types, thin being the typical cluster node with 112 GB memory and fat nodes having 256 GB or 512 GB of memory. You may specify a node with more RAM, by adding the words "-C fat" to your job submission line and thus making sure that you will get at least 256 GB of RAM on each node in your job.

If you absolutely must have more than 256 GB of RAM then you can request to get 512 GB of RAM specfifically by adding the words "-C mem512GB" to your job submission line.

Please note that there are only 5 nodes with 256 GB of RAM, and only 2 nodes with 512 GB of RAM.