The Durham Intelligent NIC Environment (DINE) supercomputer is a small 24-node development cluster equipped with NVIDIA BlueField-2 Data Processing Units (DPUs) using a non-blocking HDR200 fabric. These DPUs enable direct access to remote memory to improve the performance of massively parallel codes, in preparation for future exascale systems, and will provide researchers with a test-bed facility development of new and novel computing paradigms.
The cost of data movement - both runtime and energy - is predicted to be one major showstopper on our road to exascale. As computers driving data centres, supercomputers and machine learning farms become faster, their interconnects, i.e. communication devices, grow into a limiting factor; even worse, they also face the omnipresent unreliability that will arise. One way to improve them is to make them smart – to make them learn how to route data flows, how to meet security constraints, or even to deploy computations into the network. Smart network devices can take ownership of the data movement, bring data into the right format before it is delivered, care about security and resiliency, and so forth.
DINE has been funded by DiRAC, ExCALIBUR, the Department of Computer Science and the Institute for Computational Cosmology as part of a strategic research equipment purchase.
Please see DINE notes for information about using DINE.
The Durham Intelligent NIC Environment (DINE) supercomputing facility is hosted alongside by COSMA, and is used by Computer Science researchers, DiRAC researchers and international collaborators.
A key feature of DINE is the NVIDIA BlueField smart NIC cards which provide a programmable network offload capability, allowing network functions to be accelerated, and freeing up compute cores for other tasks.
DINE is comprised of 24 nodes each containing:
- Dual 16-core AMD EPYC 7302 ROME processors (3GHz)
- 512GB RAM
- BlueField-2 Smart NIC (200 GBit/s HDR200)
- These contain 16GB RAM, 8 high clock ARM cores and Ubuntu 20.04
- NVIDIA HDR200 InfiniBand switch
Students will also benefit from working with cutting-edge technologies, designing algorithms and investigating ideas which will carried forward into future UK and international facilities.
Access is available free of charge to the UK research community. High priority will be given to developmental and fundamental Exascale research (no production runs).
Students will also be given access and hence benefit from working with cutting-edge technologies. This will help them to design algorithms and investigate ideas which will carried forward into future UK and international facilities.
We are willing to give collaborators and external scientists access to the system as well to allow them to prototype novel algorithms and write new software using smart network devices.
To get access, please follow these instructions to apply for an account, signing up to project do009, and then send a message to firstname.lastname@example.org mentioning your interest in BlueField.
COSMA8 login nodes should be used for compiling code (also AMD Rome processors). Where native ARM access is required, please create a Slurm job to run on the partition (bluefield1), and then ssh directly to the local BlueField card.
The SLURM workload manager should then be used to submit jobs to the other nodes, using the bluefield1 queue.
If should be noted that DINE has automatic powersaving features - unused nodes will be powered off after 1 our. When a Slurm job is submitted, these nodes will be powered on if necessary, which can take a few minutes.
Hints and tips for usage
The Ethernet network (control, ssh, slurm) has:
Hostnodes, b[101-124]: 172.17.178.[201-224]
BlueField cards, bluefield[101-124]: 172.17.179.[201-224]
InfiniBand is reached via:
bfh[101-124]: 172.18.178.[201-224] for the hosts
bfd[101-124]: 172.18.179.[201-224] for the BlueField cards
The BlueField cards (devices) operate in "Host Separated" mode, meaning that they can be treated as servers in their own right (running Ubuntu), and MPI jobs can run on both host and device.
Currently, manual mpirun calls are necessary to specify the hosts and devices to use.
If you have any hints that you would like to appear here, please let us know!
ExaClaw - Clawpack-enabled ExaHyPE for heterogeneous hardware
Durham project funded by EPSRC under the ExCALIBUR programme.
ExaHyPE - an Exascale Hyperbolic PDE Engine
EU H2020 FET HPC project with partners from Munich (Technische Universitat Munchen and Ludwig-Maximilians Universitat), Trento and Frankfurt.
In Peano and ExaHyPE, we have been suffering from a lack of MPI progress and, hence, algorithmic latency for quite a while and invested significant compute effort to decide how to place our tasks on the system. We hope that BlueField will help us to realise these two things way more efficiently. Actually, we started to write software that does this for us on the BlueField in a blackbox way.Tobias Weinzierl, Project PI
- Durham's Master in Scientific Computing and Data Analysis hosts several modules discussing aspects of novel HPC.
- The Department of Computer Science formally sponsors/purchases this tool.
- Durham's Student Cluster Competition team.
- The teaMPI software is one of the first tools tailored towards SmartNICs.
- Let us know if you want to be added.
- Philipp Samfass et al write on Lightweight Task Offloading Exploiting MPI Wait Times for Parallel Adaptive Mesh Refinement. This task offloading mechanism is something we port to SmartNICs.
- Dominic E. Charrier et al write on Enclave Tasking for Discontinuous Galerkin Methods on Dynamically Adaptive Meshes, a technique that yields many tiny tasks. Implementation challenges (incl MPI progression) are sketched and the need for smart network devices is highlighted.
- Follow Philipp Samfass et al it ISC 2020 (online presentation will become available later) when we present our work on TeaMPI—Replication-based Resilience without the (Performance) Pain.
This work has used Durham University's DINE cluster. DINE has been purchased through Durham University’s Research Capital Equipment Fund 19_20 Allocation, led by the Department of Computer Science. It is installed in collaboration and as addendum to DiRAC@Durham facility managed by the Institute for Computational Cosmology on behalf of the STFC DiRAC HPC Facility (www.dirac.ac.uk). DiRAC equipment was funded by BEIS capital funding via STFC capital grants ST/P002293/1, ST/R002371/1 and ST/S002502/1, Durham University and STFC operations grant ST/R000832/1. DiRAC is part of the National e-Infrastructure.