Get the right data to the right node at the right time.
The cost of data movement - both runtime and energy - is predicted to be one major showstopper on our road to exascale. As computers driving data centres, supercomputers and machine learning farms become faster, their interconnects, i.e. communication devices, grow into a limiting factor; even worse, they also face the omnipresent unreliability that will arise. One way to improve them is to make them smart – to make them learn how to route data flows, how to meet security constraints, or even to deploy computations into the network. Smart network devices can take ownership of the data movement, bring data into the right format before it is delivered, care about security and resiliency, and so forth.
The Durham Intelligent NIC Environment (DINE) supercomputer is a small cluster equipped with Mellanox Smart NICs.These Smart NICs will enable direct access to remote memory to improve the performance of massively parallel codes, in preparation for future exascale systems, and will provide researchers with a test-bed facility development of new and novel computing paradigms. A Mellanox smart switch will also allow investigations into in-network computing to be carried out, which when combined with the smart NICs will lead to a novel HPC computing environment.
As an experimental device, the system has been purchased by Durham's Department of Computer Science as part of a strategic research equipment purchase. It was installed in collaboration with DiRAC/COSMA. The PIs behind this project are Alastair Basden and Tobias Weinzierl.
The DINE supercomputer will allow researchers to probe novel technologies in preparation for running advanced codes on exascale machines, enabling a step change in model resolution in fields such as weather forecasting, climate change and cosmology, with a huge scientific benefit.Alastair Basden, HPC Technical Manager (COSMA)
The Durham Intelligent NIC Environment (DINE) supercomputing facility is hosted alongside by COSMA, and is used by Computer Science researchers, DiRAC researchers and international collaborators.
A key feature of DINE is the Mellanox BlueField smart NIC cards which provide a programmable network offload capability, allowing network functions to be accelerated, and freeing up compute cores for other tasks.
DINE is comprised of 16 nodes each containing:
- Dual 16-core AMD EPYC 7302 ROME processors (3GHz)
- 128GB RAM
- BlueField-1 Smart NIC (25GBit/s)
- Mellanox Spectrum Ethernet switch
Students will also benefit from working with cutting-edge technologies, designing algorithms and investigating ideas which will carried forward into future UK and international facilities.
Access is available free of charge to Durham scientists. High priority will be given to fundamental research (no production runs).
Students will also be given access and hence benefit from working with cutting-edge technologies. This will help them to design algorithms and investigate ideas which will carried forward into future UK and international facilities.
We are willing to give collaborators and external scientists access to the system as well to allow them to prototype novel algorithms and write new software using smart network devices.
To get access, please follow these instructions to apply for an account, signing up to project hpcicc, and then send a message to firstname.lastname@example.org mentioning your interest in BlueField.
ExaClaw - Clawpack-enabled ExaHyPE for heterogeneous hardware
Durham project funded by EPSRC under the ExCALIBUR programme.
ExaHyPE - an Exascale Hyperbolic PDE Engine
EU H2020 FET HPC project with partners from Munich (Technische Universitat Munchen and Ludwig-Maximilians Universitat), Trento and Frankfurt.
In Peano and ExaHyPE, we have been suffering from a lack of MPI progress and, hence, algorithmic latency for quite a while and invested significant compute effort to decide how to place our tasks on the system. We hope that BlueField will help us to realise these two things way more efficiently. Actually, we started to write software that does this for us on the BlueField in a blackbox way.Tobias Weinzierl, Project PI
- Durham's Master in Scientific Computing and Data Analysis hosts several modules discussing aspects of novel HPC.
- The Department of Computer Science formally sponsors/purchases this tool.
- Durham's Student Cluster Competition team.
- The teaMPI software is one of the first tools tailored towards SmartNICs.
- Let us know if you want to be added.
- Philipp Samfass et al write on Lightweight Task Offloading Exploiting MPI Wait Times for Parallel Adaptive Mesh Refinement. This task offloading mechanism is something we port to SmartNICs.
- Dominic E. Charrier et al write on Enclave Tasking for Discontinuous Galerkin Methods on Dynamically Adaptive Meshes, a technique that yields many tiny tasks. Implementation challenges (incl MPI progression) are sketched and the need for smart network devices is highlighted.
- Follow Philipp Samfass et al it ISC 2020 (online presentation will become available later) when we present our work on TeaMPI—Replication-based Resilience without the (Performance) Pain.
This work has used Durham University's DINE cluster. DINE has been purchased through Durham University’s Research Capital Equipment Fund 19_20 Allocation, led by the Department of Computer Science. It is installed in collaboration and as addendum to DiRAC@Durham facility managed by the Institute for Computational Cosmology on behalf of the STFC DiRAC HPC Facility (www.dirac.ac.uk). DiRAC equipment was funded by BEIS capital funding via STFC capital grants ST/P002293/1, ST/R002371/1 and ST/S002502/1, Durham University and STFC operations grant ST/R000832/1. DiRAC is part of the National e-Infrastructure.