Kueue integration (MasterQueue)

The bidirectional messaging layer provided by NATS enables feedbacks on the status of the resource pools on the resources they may provide. To enable the definition of policies on the resource usage defined at cluster level (rather than at node level) the InterLink resource pools are mapped into kueue ResourceFlavors and organized in ClusterQueues.

A custom controller, named kueue-nats, connects to the NATS server and subscribe to the subjects under which resource providers publish the available resources. When receiving resource updates, the kueue-nats controller creates or updates ResourceFlavors the ClusterQueues accordingly. The configuration of the kueue-nats controller relies on a Kubernetes Custom Resource (CR) named MasterQueue.

A MasterQueue is supposed to be the cluster-level object collecting all the resource pools made available to the cluster either via InterLink or as local resources. The MasterQueue can then lend resources to groups of users or client applications using the Kueue resource-borrowing mechanism. All the ClusterQueues spawned by a MasterQueue belong to the same cohort, named by the MasterQueue itself.

Organizing Flavors

As mentioned, ResourceFlavors map resource pools. Since multiple resource pools might be equivalent and interchangeable to the applications running in the cluster (think for example of two Kubernetes clusters running in different sites), kueue-nats introduces groups of ResourceFlavors, mapping groups of equivalent resource pools.

Groups of resource flavors may fall in two categories: natsFlavor, managed by InterLink via the InterLink NATS plugin, or localFlavor, managed by the cluster administrator directly via Kueue and made available to the MasterQueue.

natsFlavors

A natsFlavor should indicate the following properties: * natsConnector: a string defining the connection to the NATS server. If deploying the NATS plugin with the interlink-in-one, this string is provided by Helm itself when installing or updating the chart. * natsSubject: a string representing the NATS subject used by resource pools to publish updates on the available resources * virtualNode: is the name of the virtual node where to address pods to be submitted with interlink * poolTimeout: an integer defining the timeout in seconds. If no update to the available resources is obtained after this time interval, the corresponding resource flavor is dropped.

The names of pools part of the group can be listed explicitly using the pools keyword,

pools:
  - podman-firenze
  - slurm-cineca

or it can be defined with a regular expression using the poolRegExp, for example

poolRegExp: "podman-.*"

Note

Note that the regular expression is managed with the re module in Python.

Minimal MasterQueue definition

A minimal example of a MasterQueue is defined below with annotations

apiVersion: vk.io/v1
kind: MasterQueue
metadata:
  name: masterqueue  # This is the name of the MasterQueue 
spec:
  template:
    cohort: masterqueue  # This is the name of the cohort. It is suggested to use the same name as for the MasterQueue.

  flavors:   # Here we introduce the list of resource flavors organized in categories. RFs can be either `natsFlavor`s or `localFlavor`s.
    - name: interlink  # This is the name of the flavor category (not of the flavor itself!)
      natsFlavor:
        natsConnector: "nats://user:password@nats.nats:4222"
        natsSubject: interlink.resources
        virtualNode: interlink
        poolTimeout: 30
        poolRegExp: ".*"
      nominalQuota:
        pods: 100
      canLendTo:
        - queue: group1
          lendingLimit:
            pods: 30

This creates two ClusterQueue objects belonging to the same cohort masterqueue. One named masterqueue (as the MasterQueue object) has quotas based on the resources published by the various providers, the other, named group1 has no quota, but a borrowingLimit for pods set to 30.

Warning

The limits are specified per resource pool. In the example above, if two resource pools are available, the ClusterQueue group1 will be entitled to borrow 30 pods from the first one and 30 pods from the second.

Creating a LocalQueue

To connect a namespace to a ClusterQueue, the cluster admins must create LocalQueue objects. For example, to enable executing jobs in the default namespace in the group1 ClusterQueue, one may define the gr1 LocalQueue as

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
   namespace: default
   name: gr1
spec:
   clusterQueue: group1

A test job

To test the setup, one may submit a test job in the default namespace through the queue gr1.

apiVersion: batch/v1
kind: Job
metadata:
  generateName: sample-job-1
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: gr1
spec:
  parallelism: 1
  completions: 1
  suspend: true
  template:
    spec:
      containers:
      - name: dummy-job
        image:  ghcr.io/grycap/cowsay:latest
        command: ["/bin/sh", "-c"]
        args: 
         - |
                echo "Hello world!"
                sleep 30

        resources:
          requests:
            cpu: 1
            memory: "200Mi"
      tolerations:
        - key: virtual-node.interlink/no-schedule
          operator: Exists
          effect: NoSchedule
      restartPolicy: Never

Note in particular: * the label kueue.x-k8s.io/queue-name: gr1 defining the local queue, * the spec suspend: true to let Kueue to submit the job once admitted, * the toleration to the node taint virtual-node.interlink/no-schedule marking the job as suitable for offloading.

Local flavors

To enable local execution of payloads not suitable for offloading, one may include a local flavor in the MasterQueue. Local flavors are a simply links to the standard ResourceFlavor objects of the Kueue that must be already defined in the cluster and fall outside the range of action of the kueue-nats controller.

For example the following yaml manifest defines a ResourceFlavor and makes it available to the group1 ClusterQueue via the MasterQueue.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "local-cpu"  # <-- Note this name

---

apiVersion: vk.io/v1
kind: MasterQueue
metadata:
   name: masterqueue  
spec:
   template:
      cohort: masterqueue  

   flavors:   
      - name: local-cpu   # <-- This name should match the ResourceFlavor definition above
        localFlavor: {}
        nominalQuota:
           pods: 100
        canLendTo:
           - queue: group1
             lendingLimit:
                pods: 30