Skip to end of metadata
Go to start of metadata

How do I get help?

The primary methods for accessing help are through Email

There are two ways to get help by email.

Ask another user

You can sign up to the HPC discussion list (hpc-discuss@lists.bath.ac.uk) for users of the HPC facilities here at Bath:
http://lists.bath.ac.uk/sympa/info/hpc-discuss

Ask the BUCS HPC Support Team

For problems with the HPC Service you can contact us by emailing support-hpc@bath.ac.uk.

FAQ

How do I get access to the cluster?

Each department has a nomimated member of staff who can grant access to the cluster. This list can be found on the Aquila wiki page.

Once you have been added you should be able to SSH into Aquila as you would any of the other User Service (Unix) Machines

Can I access my BUCS home directories on the cluster?

For practical reasons your home directory on the cluster is not the same as your main BUCS home directory. However, on the master node your BUCS home directories can be found in the same location as on any of the BUCS Unix machines. This location is mapped into a shell variable '$BUCSHOME'

ee0mdc:aquila-0:~ $ pwd
/home/ee0mdc
ee0mdc:aquila-0:~ $ echo $BUCSHOME
/u/d/ee0mdc
ee0mdc:aquila-0:~ $ pushd $BUCSHOME
/u/d/ee0mdc ~
ee0mdc:aquila-0:ee0mdc $ pwd
/u/d/ee0mdc

How do I cite the cluster?

We weren't sure about this, so we asked Prof James Davenport the question "How should I cite my/our use of Aquila in a published paper?"

He suggests the following:

"These computations were performed on the University of Bath's High Performance Computing Facility."

If the time was funded by an external agency, one might add ", time on which was funded by the Pragmatric Philosophy and Astrology Research Council under grant GR/C/666".

He hopes soon to have (and will add here) a short page describing the machine suitable for inclusion (as a URL). If one wants a description in the meantime, his suggestion is this: "An 800-core Intel Xeon E5430 2.66GHz machine at 2GB/core of 667MHz memory, with DDR Infiniband interconnect".

What software (numerical libraries, compilers, etc.) is available for use?

Compilers

  • Intel Compiler Suite; versions 10.1.015 and 11.1/046
  • GNU Compiler Collection; versions 4.1.2, 4.3.4 and 4.4.4.
  • Open64 compiler; version 4.2.2.2

Precompiled Libraries

These are always expanding, so it is best to check on the machine itself. The command module avail will give you a list of what is currently available.

  • LAPACK 3.2.1
  • ScaLAPACK 1.8.0
  • BLAS 1
    • ACML 4.3.0
    • Intel MKL 10.2.1.017
    • ATLAS 3.9.23
    • GotoBLAS 1.25
    • GotoBLAS2 1.13
  • BLACS 1.1 patch03
  • bonnie++ 1.96
  • FFT
    • FFTW2 2.1.5
    • FFTW33.2.2
  • globalarrays 4.2
  • hdf5 1.6.9
  • MPI
  • NAG Fortran compiler
  • NAG F77 library
  • NAG F90 library
  • NAG parallel library
  • NAG SMP library

Applications

How much is this going to cost?

Costing is straightforward - pFact now knows about it (MRF-Computing), and it is about 50p per node (all 8 cores) per hour. Users are expected to apply for funding on this basis.

A Node has 2 x 4-core 2.8GHz processors with 16G memory.

How do I run my program on the cluster?

Access to the nodes on the cluster is controlled by a queuing system. To run your code on the cluster you need to write a job script which tells the queuing system what resources you wish to use and how to run your program.

More information about this queueing system can be found in the HPC Queuing section of this wiki.

In it's simplest form:

msub my_job

Why isn't my job just running?

If you think that an idle job should have started then you can check why the scheduler hasn't started it with the checkjob command. Often it will be being waiting while the scheduler is trying to get enough nodes free to let a higher priority parallel job run. If your job is not at the top of the queue and won't start even though there seem to be free nodes, this is almost always the reason.

This behaviour is caused by reservation and backfill, which is a system that allows mixed serial and parallel jobs to run while also keeping utilisation reasonably high. Our scheduler takes the queued job with the highest priority and works out when the earliest time it could run is, assuming all running jobs use their full walltime allocation (it is not intelligent enough to do otherwise). It then takes the calculated start time for the queued job and the set of nodes it will run on and reserves those nodes. Nothing will be allowed to start that could impinge on the start time of the top job. This means that if a big parallel job is waiting and the system has some free nodes, but not enough to let it run, then no small jobs will be allowed to start on the free nodes unless their walltime limit is small enough that they will be finished before the top job's calculated start time. Otherwise the top job could get delayed by a stream of smaller jobs.

If you want to run a small, short job in this situation, use the showbf command to see what the biggest walltime limit you can use without delaying the waiting large job is. If you set a limit just under that time then your job will get backfilled onto the free nodes.

[cs1jrj@aquila ~]$ showbf
backfill window (user: 'cs1jrj' group: 'aquila-users' partition: ALL) Mon Sep 14 16:56:20

760 procs available for      14:03:40



[cs1jrj@aquila ~]$

The scheduler recalculates the top job and the reservations regularly, so it's possible for things to change if jobs finish earlier than expected or as fairshare changes the priority of the queued jobs.

Error: "p4_error: semget failed for setnum: 0"

This is usually caused by the untimely demise of a previous job you ran on the same nodes. As it stands you will be unable to fix this yourself. Please contact support-hpc@bath.ac.uk and we will sort it out for you.

Error: "mpiexec_node100: cannot connect to local mpd (/tmp/mpd2.console_username);"

If you see this error, it means that you did not properly start the 'mpd' daemon before using 'mpirun' with either MPICH2 or MVAPICH2. The 'mpd' daemon must first be started using the 'mpdboot' command, and shut down at the end with 'mpdallexit', as in the example job scripts on the MPICH2 and MVAPICH2 pages.

  • There is a separate MPI FAQ which you may also like to look at.

Can I use OpenMP on the cluster ?

Yes, the appropriate libraries have been installed on the cluster for you to user OpenMP. However, please note that by default OpenMP will only use 1 node on the cluster per "program". This means that if you want to run multiple programs they should be run using multiple job scripts, each program submitted as a separate job.

Is there a way to get status information on the nodes in the cluster?

The Ganglia web interface should give you most of the information you need.

Is there local storage available on each node?

It is permitted to write to /local on each of the nodes, however it's recommended that most check pointing is done into your home directory as you're not guaranteed that you'll get the same nodes each time, also /local will be wiped at the end of your run.

 

Scheduling

Unable to copy job's output and error log files

Should you receive an email from the scheduler mentioning that it was unable to copy back the stageout files, e.g.

PBS Email

In this particular instance Torque was unable to handle the special "()" characters, so please avoid using special characters in naming your files and directories. Unix special characters are: ! @ $ ^ & * ~ ? . | / [ ] < > \ ` " ;# ( )


3 Comments

  1. "More software (e.g. Ansys and Matlab) is being explored."

     Any news on this? The ability to use common academic applications which would benefit from the HPC cluster (such as Ansys and Matlab) in a simple way is the most common request about HPC  I'm  picking up from various departments (elec-eng, architecture etc.)

    1. Unknown User (ee0mdc)

      To quote James...

      "Matlab is being actively pursued, and we have a quote which needs clarification (e.g. "what is a worker" - do we have 16, 32 or 28 of these?).
      Ansys: there was a meeting while I was away, and I haven't had any feedback yet.

      James Davenport"

    2. Unknown User (ee0mdc)

      There are plans to get a trial licence for aquila, and purchase one if the trial shows significant performance.