Motivation
Currently, tiCrypt only supports Secure Virtual Machines as compute nodes for Local Slurm. While this provides strong security guarantees, it may not be the best fit for all workloads since it adds virtualization overhead in the form of:
- High startup cost: it takes 20-30 seconds to create and provision a VM as a Slurm node for Local Slurm
- Virtualization cost: virtual machines add a 5-10% overhead to most computational tasks.
- Degraded high-performance networking: virtualization reduces network throughput and increases latency, which is particularly damaging for MPI jobs.
To address these issues, we are planning to add support for bare-metal nodes in Local Slurm. This article explores the design and implementation of this new feature.
Existing Architecture
The Global Slurm manages overall cluster resources and ensures fairness among projects and users. The Local Slurm securely executes jobs inside secure enclaves. Users submit jobs to the Local Slurm, which transforms them into global jobs while keeping the details of execution (code and data) secret from the Global Slurm. The execution steps are as follows:
- The user submits a job to the Local Slurm, specifying the resources needed and the code to execute.
- The Local Slurm job gets intercepted (using Lua plugin) by
ticrypt-vm-controller and transformed into a request for a global job to the tiCrypt Backend.
- The tiCrypt Backend creates a Global Slurm job (marked with the account of the user and the project).
- The Global Slurm schedules the job on a compute node and starts it.
ticrypt-host-controller, the tiCrypt agent running on the compute nodes, receives the job (via the job submission script) and starts a secure VM for the job. The tiCrypt backend is notified of the VM creation to ensure it is handed over to the ticrypt-vm-controller for provisioning.
- The
ticrypt-vm-controller starts slurmd on the VM, incorporating it into the Local Slurm cluster. The job then executes inside the secure VM under slurmd.
- Once the job finishes,
ticrypt-vm-controller is notified by the Lua plugin and destroys the VM, removing it from the Local Slurm cluster.
- The Global Slurm job finishes, and the resources are released.
The existing architecture is illustrated in the figure below:
Design of the New Feature
While it might be tempting to use the traditional Slurm execution model on bare-metal nodes, job security would be significantly compromised. The following issues would arise:
- Job isolation: without VMs, jobs would run directly on the host OS, which means that they would not be isolated from each other. This could lead to security issues, especially if one job is compromised and can access the data of another job.
- Access to data: the current execution model makes data available only inside the secure VMs. Specifically, the VM provisioning sets up a VPN (based on StrongSwan) and mounts the data from the controlling VM running the Local Slurm controller. Since all the communication, including the file accesses, goes through the VPN, the data is secured from the infrastructure.
- Protection from infrastructure and admins: without VMs, jobs would be exposed to the infrastructure and admins, which could lead to security issues, especially if the infrastructure is compromised.
Key Ideas for the New Feature
The following key ideas address these issues while preserving security:
- Use of containers: instead of running jobs directly on the host OS, we can use containers to provide isolation between jobs. This would allow us to run jobs on bare-metal nodes while still providing some level of isolation. Specifically, we want to use Apptainer/Singularity, which is a container platform designed for HPC environments.
- Use of the whole node: instead of sharing the node between multiple jobs, we can dedicate the whole node to a single job. This would provide better isolation and security, as well as better performance since there would be no contention for resources between jobs. This is a perfect model for MPI jobs, which require high-performance networking and low latency. It is extremely wasteful for small jobs.
- Auto-provisioned security: to ensure that the jobs are still secure, automatic mechanisms need to be put in place to provision the security. No manual or admin control mechanism should be used, as such mechanisms could compromise security.
- tiCrypt integration: The
ticrypt-vm-controller and Local Slurm cannot tell the difference between a VM and a bare-metal container. As long as the container integrates the networking, tiCrypt registration and provisioning steps, the rest of the architecture is compatible with the new feature.
- Networking for containers: to ensure integration with existing architecture, the
br-secure network bridge used for the VMs should also be used for the containers. This would allow the containers to communicate with the Local Slurm controller and to access the data securely.
Overall Implementation Plan
The following components need to be enhanced to implement the new feature:
- Container image: a new container image needs to be created that includes
slurmd, the ticrypt-vm-controller, StrongSwan and the software required to run Slurm jobs. This container image should be based on the existing VM images used for Local Slurm to ensure compatibility.
- Container provisioning: the
ticrypt-host-controller needs to be enhanced to detect when the Slurm job requires --exclusive access to a node and to start a container instead of a VM. The container should be started so that it takes over the whole node.
- Networking: the container needs to be configured to use the
br-secure network bridge to ensure secure communication with the Local Slurm controller and access to data.
- Container execution: the
ticrypt-host-controller needs to
- Execute the container with capabilities for network configuration, NFS mounts, and "fake-root" to allow user creation.
- Use overlays to allow changes mandated by user management and Slurm job execution.
- Write overlay data to a temporary encrypted drive (in case sensitive data is written to the overlay) and delete it after the job finishes. Do not save the encryption key to ensure the drive cannot be recovered after deletion.
- Stop the container once the Global Slurm job finishes, similar to how VMs are stopped in the existing architecture.
- Enhance the local job-tracking database to track both VMs and containers.
- Information propagation: The
--exclusive flag needs to be propagated from the Local Slurm job submission to the Global Slurm via ticrypt-vm-controller and the tiCrypt Backend.
Security Considerations
Using containers instead of VMs does reduce the security guarantees, but the design choices outlined above mitigate the risks:
- Blocking outgoing traffic: the container will use Open VSwitch based networking set up by the
ticrypt-host-controller using exclusively the br-secure bridge. This network is already set up to prevent data exfiltration since it controls masquerading to only allowed IP+Port combinations. This is equivalent to the Libvirt solution for VMs, in which Libvirt is told to use br-secure, not to set up its own bridge+dnsmasq+masquerading rules. This is, by far, the most critical security measure since it prevents data exfiltration even if the container is compromised or if the user tries to exfiltrate data.
- Whole-node containers: dedicating the whole node to a single job provides better isolation and security since there are no other jobs running at the same time. By scrubbing the node between jobs (mostly using the overlays on encrypted drives that get destroyed after the job finishes), we can ensure that no data is left on the node after the job finishes.
- IPSec (StrongSwan) inside the container: should ensure some level of protection against the host/infrastructure as well.
Aspects for which mitigation is difficult include:
- Access to the NFS data share: As the VMs do, the containers will mount the NFS data share from the controlling VM. This mount is visible in the host OS, thus, in principle admins can access it. There are ways to make it "invisible" to the OS but probably not a full mitigation.
The exposure to the system admin is much higher with containers than with VMs, but such a risk can be mitigated and this solution allows better performance. Since no user accounts are needed on the host, at least rootkit-style attacks are not possible.
Implementation Challenges
Most of the work is straightforward, but a few areas require extra care:
- Networking: Ports from the
br-secure OpenVSwitch bridge must be used to provision networking for the containers. The OVS-LINK tool should serve as inspiration for this.
- Capabilities: The container needs the right Linux capabilities to configure networking, mount NFS drives, and create users. These must be carefully scoped so the container runs as the
ticrypt user while still having the permissions needed to execute jobs and access data. Running as root is a fallback but undesirable from a security standpoint.
- StrongSwan and hardware-assisted encryption: StrongSwan can offload encryption to network hardware, ensuring the VPN does not add a performance penalty. This will likely require careful implementation and experimentation.
- MPI integration: MPI is tricky with containers, especially when networking is virtualized. Careful configuration and testing are needed to ensure MPI jobs run efficiently.
Conclusions
Adding bare-metal support to Slurm+tiCrypt is made surprisingly straightforward by an existing architecture that is already flexible and built on standard components. The challenges are real but manageable with careful design. The new feature will deliver better performance for demanding workloads, especially MPI jobs, while preserving a strong security posture. We look forward to seeing how it performs in practice.