- How do I get help?
- How do I get access to the cluster?
- Can I access my BUCS home directories on the cluster?
- How do I cite the cluster?
- What software (numerical libraries, compilers, etc.) is available for use?
- How much is this going to cost?
- How do I run my program on the cluster?
- Why isn't my job just running?
- Error: "p4_error: semget failed for setnum: 0"
- Error: "mpiexec_node100: cannot connect to local mpd (/tmp/mpd2.console_username);"
- Can I use OpenMP on the cluster ?
- Is there a way to get status information on the nodes in the cluster?
- Is there local storage available on each node?
How do I get help?
The primary methods for accessing help are through Email
There are two ways to get help by email.
Ask another user
You can sign up to the HPC discussion list (firstname.lastname@example.org) for users of the HPC facilities here at Bath:
Ask the BUCS HPC Support Team
How do I get access to the cluster?
Each department has a nomimated member of staff who can grant access to the cluster. This list can be found on the Aquila wiki page.
Can I access my BUCS home directories on the cluster?
For practical reasons your home directory on the cluster is not the same as your main BUCS home directory. However, on the master node your BUCS home directories can be found in the same location as on any of the BUCS Unix machines. This location is mapped into a shell variable '$BUCSHOME'
How do I cite the cluster?
We weren't sure about this, so we asked Prof James Davenport the question "How should I cite my/our use of Aquila in a published paper?"
He suggests the following:
"These computations were performed on the University of Bath's High Performance Computing Facility."
If the time was funded by an external agency, one might add ", time on which was funded by the Pragmatric Philosophy and Astrology Research Council under grant GR/C/666".
He hopes soon to have (and will add here) a short page describing the machine suitable for inclusion (as a URL). If one wants a description in the meantime, his suggestion is this: "An 800-core Intel Xeon E5430 2.66GHz machine at 2GB/core of 667MHz memory, with DDR Infiniband interconnect".
What software (numerical libraries, compilers, etc.) is available for use?
- Intel Compiler Suite; versions 10.1.015 and 11.1/046
- GNU Compiler Collection; versions 4.1.2, 4.3.4 and 4.4.4.
- Open64 compiler; version 184.108.40.206
These are always expanding, so it is best to check on the machine itself. The command module avail will give you a list of what is currently available.
- LAPACK 3.2.1
- ScaLAPACK 1.8.0
- BLAS 1
- ACML 4.3.0
- Intel MKL 10.2.1.017
- ATLAS 3.9.23
- GotoBLAS 1.25
- GotoBLAS2 1.13
- BLACS 1.1 patch03
- bonnie++ 1.96
- FFTW2 2.1.5
- globalarrays 4.2
- hdf5 1.6.9
- NAG Fortran compiler
- NAG F77 library
- NAG F90 library
- NAG parallel library
- NAG SMP library
- MATLAB Compiler Runtime
- Ansys CFX
- Espresso 4.3
- PAUP 4b10
- openfoam 1.6
- netcdf 4.0.1
- mrbayes 3.1.2
How much is this going to cost?
Costing is straightforward - pFact now knows about it (MRF-Computing). Users are expected to apply for funding on this basis.
How do I run my program on the cluster?
Access to the nodes on the cluster is controlled by a queuing system. To run your code on the cluster you need to write a job script which tells the queuing system what resources you wish to use and how to run your program.
More information about this queueing system can be found in the HPC Queuing section of this wiki.
In it's simplest form:
Why isn't my job just running?
If you think that an idle job should have started then you can check why the scheduler hasn't started it with the checkjob command. Often it will be being waiting while the scheduler is trying to get enough nodes free to let a higher priority parallel job run. If your job is not at the top of the queue and won't start even though there seem to be free nodes, this is almost always the reason.
This behaviour is caused by reservation and backfill, which is a system that allows mixed serial and parallel jobs to run while also keeping utilisation reasonably high. Our scheduler takes the queued job with the highest priority and works out when the earliest time it could run is, assuming all running jobs use their full walltime allocation (it is not intelligent enough to do otherwise). It then takes the calculated start time for the queued job and the set of nodes it will run on and reserves those nodes. Nothing will be allowed to start that could impinge on the start time of the top job. This means that if a big parallel job is waiting and the system has some free nodes, but not enough to let it run, then no small jobs will be allowed to start on the free nodes unless their walltime limit is small enough that they will be finished before the top job's calculated start time. Otherwise the top job could get delayed by a stream of smaller jobs.
If you want to run a small, short job in this situation, use the showbf command to see what the biggest walltime limit you can use without delaying the waiting large job is. If you set a limit just under that time then your job will get backfilled onto the free nodes.
The scheduler recalculates the top job and the reservations regularly, so it's possible for things to change if jobs finish earlier than expected or as fairshare changes the priority of the queued jobs.
Error: "p4_error: semget failed for setnum: 0"
This is usually caused by the untimely demise of a previous job you ran on the same nodes. As it stands you will be unable to fix this yourself. Please contact email@example.com and we will sort it out for you.
Error: "mpiexec_node100: cannot connect to local mpd (/tmp/mpd2.console_username);"
If you see this error, it means that you did not properly start the 'mpd' daemon before using 'mpirun' with either MPICH2 or MVAPICH2. The 'mpd' daemon must first be started using the 'mpdboot' command, and shut down at the end with 'mpdallexit', as in the example job scripts on the MPICH2 and MVAPICH2 pages.
- There is a separate MPI FAQ which you may also like to look at.
Can I use OpenMP on the cluster ?
Yes, the appropriate libraries have been installed on the cluster for you to user OpenMP. However, please note that by default OpenMP will only use 1 node on the cluster per "program". This means that if you want to run multiple programs they should be run using multiple job scripts, each program submitted as a separate job.
Is there a way to get status information on the nodes in the cluster?
The Ganglia web interface should give you most of the information you need.
Is there local storage available on each node?
It is permitted to write to /local on each of the nodes, however it's recommended that most check pointing is done into your home directory as you're not guaranteed that you'll get the same nodes each time, also /local will be wiped at the end of your run.
Unable to copy job's output and error log files
Should you receive an email from the scheduler mentioning that it was unable to copy back the stageout files, e.g.
In this particular instance Torque was unable to handle the special "()" characters, so please avoid using special characters in naming your files and directories. Unix special characters are: ! @ $ ^ & * ~ ? . | / [ ] < > \ ` " ;# ( )