HPC represents the cornerstone of scientific and technical computing. Typically, HPC applications run on the most technologically advanced and performant hardware – many times custom-made. EVOLVE aims to build a converged infrastructure to bring together the HPC, Cloud, and Big Data worlds and the hardware platform used in EVOLVE, has HPC characteristics, but uses virtualization to enable sharing the infrastructure across many different application types.
In HPC, applications – most commonly written using MPI – run concurrently in a large set of compute servers, networked together into a cluster. The Message-Passing Interface (MPI) is considered the de facto standard for distributed-memory or shared-nothing programming. MPI is not a programming language but a set of functions (i.e., an API) that can be used from a variety of languages.
MPI provides a low-level of abstraction for developing high performance parallel applications. It provides both basic send/receive communication primitives, as well as higher-level synchronization constructs, helping programmers to deal with issues such as data distribution and barriers (distributed synchronization points). MPI is designed for exploiting data parallelism, because all the MPI application’s processes execute in parallel the same code on different data elements. An MPI program can be written by using APIs available for many programming languages.
In EVOLVE we integrate MPI into the Kubernetes ecosystem, as a micro-service that can be executed as part of a larger workflow. In the following paragraphs we demonstrate the steps followed to achieve this integration.
Deployment of virtual clusters in Kubernetes
To support MPI applications in a container execution environment, like Kubernetes, we need a specialized image with the MPI-compatible library of choice - in our case the OpenMPI library.
Our MPI Docker image includes:
- Mellanox OFED (host kernel independent)
- Mellanox HPC-X toolkit (includes OpenMPI with RDMA support)
- OpenSSH (essential for running MPI)
- Slurm (optional, to provide compatibility with existing Slurm scripts)
With the respective images prepared, deploying MPI ready containers in Kubernetes is a matter of specifying the needed configuration in a YAML file and applying it. We refer to each such deployment as a “virtual cluster”. From the application's perspective, virtual clusters are indistinguishable from physical nodes that execute instances of the MPI processes in parallel. Each container instance in the virtual cluster has a network address that will be used by MPI and we configure password-less SSH connectivity between them to facilitate MPI deployment using a typical MPI hostfile. Multiple virtual clusters can co-exist over the physical cluster of nodes.
In EVOLVE, the YAML for starting a set of MPI containers is provided as a Karvdash template, where the user can specify the cluster node count (the number of pod replicas) and the total number of CPU cores that the MPI job is going to use. The user’s files are made available on each virtual cluster node through Karvdash under a common directory.
For HPC workloads, we exploit the availability of RDMA-enabled InfiniBand adapters on the EVOLVE platform. This helps avoiding interference to HPC applications from Kubernetes management operations and other non-HPC workloads. In the context of the Kubernetes environment, there is no special configuration necessary for InfiniBand adapters. The automatically-generated MPI hostfile includes the pod IP addresses, which they use to establish RDMA communications.
After the deployment of the virtual MPI cluster the user can compile and run applications with the Kubernetes kubectl command or Slurm’s srun. These commands can be executed from a Zeppelin notebook or embedded into larger container-based workflows using Argo.
Resource allocation for HPC workloads
Additionally, in EVOLVE, strict pod placement of MPI workloads is handled by a custom scheduler and the associated "MPI" policy. The scheduler is also able to manage MPI workloads along with “Datacenter” services such as databases, key-values stores, webservers, etc.
At this time the scheduler supports 2 different placement policies for MPI workloads:
- Dedicated: This policy places the MPI containers on dedicated nodes in order to avoid interference between MPI jobs and Datacenter services.
- Collocated: The second Policy allows the collocation of MPI jobs and Datacenter services.
In both of the above policies the scheduler statically allocates resources for MPI jobs. On the other hand, for Datacenter-type services the scheduler supports dynamic resource allocation based on a performance target metric. The resource allocation module expects users to enter a target performance, and monitors each running service, to increase resources when the observed performance is smaller than the target, or decrease resources when the opposite happens. If a node runs out of resources, the module migrates the pod.
Experiments with MPI benchmarks from the NAS benchmark suite , reproduce recent research, which reports on the benefits of running MPI workloads on Kubernetes clusters . The difference between bare-metal and running in containers is negligible, suggesting that the virtualized execution environment based on Kubernetes and Docker is more than capable of hosting HPC workloads efficiently. We have also verified that by spreading the workload to more containers, we have similar behavior to scaling a physical cluster with more nodes and there are no "hidden" performance penalties inherent to the virtualized environment.
In order to test the performance benefits of our custom scheduler, we ran a series of scenarios including both Datacenter and HPC services on the same cluster, comparing the two policies implemented. We observed a 30% performance drop on the MPI tasks under the Collocated policy compared to the interference-free Dedicated policy. On the Datacenter side we observed a 20% difference in performance between the two policies. It had been expected that MPI applications would run better when deployed unobstructed – we now have the option to provide such a policy even in a system that does not support it by default.
Note that a "virtual cluster" of any given size takes seconds to initialize and tear-down, meaning that HPC workloads can be easily adjusted to the desired scale (container/node spread), so to coexist with other types of applications.
1. Karvdash, the EVOLVE dashboard: https://github.com/CARV-ICS-FORTH/karvdash
2. NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
3. A. M. Beltre, P. Saha, M. Govindaraju, A. Younge and R. E. Grant, "Enabling HPC Workloads on Cloud Infrastructure Using Kubernetes Container Orchestration Mechanisms," 2019 IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC), Denver, CO, USA, 2019, pp. 11-20.