Provisioning Node Pools for LLM Workloads

In this lab, we'll use Karpenter to provision the Trainium-1 nodes necessary for handling the Mistral-7B chatbot workload. As an autoscaler, Karpenter creates the resources required to run machine learning workloads and distribute traffic efficiently.

tip

To learn more about Karpenter, check out the Karpenter module in this workshop.

Karpenter has already been installed in our EKS Cluster and runs as a deployment:

~$kubectl get deployment -n kube-system

NAME        READY   UP-TO-DATE   AVAILABLE   AGE

...

karpenter   2/2     2            2           11m

Since the Ray Cluster creates head and worker pods with different specifications for handling various EC2 families, we'll create two separate node pools to handle the workload demands.

Here's the first Karpenter NodePool that will provision one Head Pod on x86 CPU instances:

~/environment/eks-workshop/modules/aiml/chatbot/nodepool/nodepool-x86.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: x86-cpu-karpenter
spec:
  template:
    metadata:
      labels:
        type: karpenter
        instanceType: mixed-x86
        provisionerType: Karpenter
        workload: rayhead
        vpc.amazonaws.com/has-trunk-attached: "true" # Required for Pod ENI
    spec:
      requirements:
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["c5", "m5", "r5"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["amd64"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["on-demand"]
      expireAfter: 720h
      terminationGracePeriod: 24h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: x86-cpu-karpenter
  limits:
    cpu: "256"
  disruption:
    consolidateAfter: 300s
    consolidationPolicy: WhenEmptyOrUnderutilized

---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: x86-cpu-karpenter
spec:
  amiFamily: AL2
  amiSelectorTerms:
    - alias: al2@latest
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        volumeSize: 200Gi
        volumeType: gp3
  detailedMonitoring: true
  role: ${KARPENTER_NODE_ROLE}
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${EKS_CLUSTER_NAME}
    - tags:
        kubernetes.io/cluster/eks-workshop: owned
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${EKS_CLUSTER_NAME}
        kubernetes.io/role/internal-elb: "1"
  tags:
    app.kubernetes.io/created-by: eks-workshop

We're asking the NodePool to start all new nodes with a Kubernetes label type: karpenter, which will allow us to specifically target Karpenter nodes with pods for demonstration purposes. Since there are multiple nodes being autoscaled by Karpenter, there are additional labels added such as instanceType: mixed-x86 to indicate that this Karpenter node should be assigned to x86-cpu-karpenter pool.

The NodePool CRD supports defining node properties like instance type and zone. In this example, we're setting the karpenter.sh/capacity-type to initially limit Karpenter to provisioning On-Demand instances, as well as karpenter.k8s.aws/instance-family to limit to a subset of specific instance types. You can learn which other properties are available here. Compared to the previous lab, there are more specifications defining the unique constraints of the Head Pod, such as defining an instance family of r5, m5, and c5 nodes.

A NodePool can define a limit on the amount of CPU and memory managed by it. Once this limit is reached Karpenter will not provision additional capacity associated with that particular NodePool, providing a cap on the total compute.

This secondary NodePool will provision Ray Workers on trn1.2xlarge instances:

~/environment/eks-workshop/modules/aiml/chatbot/nodepool/nodepool-trn1.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: trainium-trn1
spec:
  template:
    metadata:
      labels:
        instanceType: trn1.2xlarge
        provisionerType: Karpenter
        neuron.amazonaws.com/neuron-device: "true"
        vpc.amazonaws.com/has-trunk-attached: "true" # Required for Pod ENI
    spec:
      taints:
        - key: aws.amazon.com/neuron
          value: "true"
          effect: "NoSchedule"
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["trn1.2xlarge"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["amd64"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["on-demand"]
      expireAfter: 720h
      terminationGracePeriod: 24h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: trainium-trn1
  limits:
    aws.amazon.com/neuron: 2
    cpu: 16
    memory: 64Gi
  disruption:
    consolidateAfter: 300s
    consolidationPolicy: WhenEmptyOrUnderutilized

---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: trainium-trn1
spec:
  amiFamily: AL2
  amiSelectorTerms:
    - alias: al2@latest
  instanceStorePolicy: RAID0
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        volumeSize: 500Gi
        volumeType: gp3
  role: ${KARPENTER_NODE_ROLE}
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${EKS_CLUSTER_NAME}
    - tags:
        kubernetes.io/cluster/eks-workshop: owned
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${EKS_CLUSTER_NAME}
        kubernetes.io/role/internal-elb: "1"
  tags:
    app.kubernetes.io/created-by: eks-workshop
    karpenter.sh/discovery: ${EKS_CLUSTER_NAME}
    aws-neuron: "true"

We're asking the NodePool to start all new nodes with a Kubernetes label provisionerType: Karpenter, which will allow us to specifically target Karpenter nodes with pods for demonstration purposes. Since there are multiple nodes being autoscaled by Karpenter, there are additional labels added such as instanceType: trn1.2xlarge to indicate that this Karpenter node should be assigned to trainium-trn1 pool.

The NodePool CRD supports defining node properties like instance type and zone. In this example, we're setting the karpenter.sh/capacity-type to initially limit Karpenter to provisioning On-Demand instances, as well as karpenter.k8s.aws/instance-type to limit to a subset of specific instance type. You can learn which other properties are available here. In this case, there are specifications matching the requirements of the Ray Workers that will run on trn1.2xlarge instances type.

A Taint defines a specific set of properties that allow a node to repel a set of pods. This property works with its matching label, a Toleration. Both tolerations and taints work together to ensure that pods are properly scheduled onto the appropriate pods. You can learn more about the other properties in this resource.

Both of these defined node pools will allow Karpenter to properly schedule nodes and handle the workload demands of the Ray Cluster.

Apply the NodePool and EC2NodeClass manifests for both pools:

~$kubectl kustomize ~/environment/eks-workshop/modules/aiml/chatbot/nodepool \

| envsubst | kubectl apply -f-

ec2nodeclass.karpenter.k8s.aws/trainium-trn1 created

ec2nodeclass.karpenter.k8s.aws/x86-cpu-karpenter created

nodepool.karpenter.sh/trainium-trn1 created

nodepool.karpenter.sh/x86-cpu-karpenter created

Once properly deployed, check for the node pools:

~$kubectl get nodepool

NAME                NODECLASS

trainium-trn1       trainium-trn1

x86-cpu-karpenter   x86-cpu-karpenter

As seen from the above command, both node pools have been properly provisioned, allowing Karpenter to allocate new nodes into the newly created pools as needed.