Run inference on AWS Inferentia

Now we can use the compiled model to run an inference workload on an AWS Inferentia node.

Create a pod for inference

Check the image that we'll run the inference on:

~$echo $AIML_DL_INF_IMAGE

This is a different image than we used for training and has been optimized for inference.

Now we can deploy a Pod for inference. This is the the manifest file for running the inference Pod:

~/environment/eks-workshop/modules/aiml/inferentia/inference/inference.yaml
apiVersion: v1
kind: Pod
metadata:
  name: inference
  namespace: aiml
  labels:
    role: inference
spec:
  nodeSelector:
    node.kubernetes.io/instance-type: inf2.xlarge
  containers:
    - command:
        - sh
        - -c
        - sleep infinity
      image: ${AIML_DL_INF_IMAGE}
      name: inference
      resources:
        limits:
          aws.amazon.com/neuron: 1
  serviceAccountName: inference

For the Inference we've set the nodeSelector section to specify a inf2 instance type.

In the resources limits section again we specify that we need a neuron core to run this Pod to expose the API.

~$kubectl kustomize ~/environment/eks-workshop/modules/aiml/inferentia/inference \

| envsubst | kubectl apply -f-

Again Karpenter detects the pending Pod which this time needs a inf2 instance with needs Neuron cores. So Karpenter launches an inf2 instance which has the Inferentia chip. You can again monitor the instance provisioning with the following command:

~$kubectl logs -l app.kubernetes.io/instance=karpenter -n kube-system -f | jq

...

  "level": "INFO",

  "time": "2024-09-19T18:53:34.266Z",

  "logger": "controller",

  "message": "launched nodeclaim",

  "commit": "6e9d95f",

  "controller": "nodeclaim.lifecycle",

  "controllerGroup": "karpenter.sh",

  "controllerKind": "NodeClaim",

  "NodeClaim": {

    "name": "aiml-v64vm"

},

  "namespace": "",

  "name": "aiml-v64vm",

  "reconcileID": "7b5488c5-957a-4051-a657-44fb456ad99b",

  "provider-id": "aws:///us-west-2b/i-0078339b1c925584d",

  "instance-type": "inf2.xlarge",

  "zone": "us-west-2b",

  "capacity-type": "on-demand",

  "allocatable": {

    "aws.amazon.com/neuron": "1",

    "cpu": "3920m",

    "ephemeral-storage": "89Gi",

    "memory": "14162Mi",

    "pods": "58",

    "vpc.amazonaws.com/pod-eni": "18"

...

The inference Pod should be scheduled on the node provisioned by Karpenter. Check if the Pod is in it's ready state:

note

It can take up to 12 minutes to provision the node, add it to the EKS cluster, and start the pod.

~$kubectl -n aiml wait --for=condition=Ready --timeout=12m pod/inference

We can use the following command to get more details on the node that was provisioned to schedule our pod onto:

~$kubectl get node -l karpenter.sh/nodepool=aiml -o jsonpath='{.items[0].status.capacity}' | jq .

This output shows the capacity this node has:

{
  "aws.amazon.com/neuron": "1",
  "aws.amazon.com/neuroncore": "2",
  "aws.amazon.com/neurondevice": "1",
  "cpu": "4",
  "ephemeral-storage": "104845292Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "16009632Ki",
  "pods": "58",
  "vpc.amazonaws.com/pod-eni": "18"
}

We can see that this node as a aws.amazon.com/neuron of 1. Karpenter provisioned this node for us as that's how many neuron the Pod requested.

Run an inference

This is the code that we will be using to run inference using a Neuron core on Inferentia:

~/environment/eks-workshop/modules/aiml/inferentia/inference/inference.py
import os
import time
import torch
import torch_neuronx
import json
import numpy as np

from urllib import request

from torchvision import models, transforms, datasets

## Create an image directory containing a small kitten
os.makedirs("./torch_neuron_test/images", exist_ok=True)
request.urlretrieve(
    "https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg",
    "./torch_neuron_test/images/kitten_small.jpg",
)


## Fetch labels to output the top classifications
request.urlretrieve(
    "https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json",
    "imagenet_class_index.json",
)
idx2label = []

with open("imagenet_class_index.json", "r") as read_file:
    class_idx = json.load(read_file)
    idx2label = [class_idx[str(k)][1] for k in range(len(class_idx))]

## Import a sample image and normalize it into a tensor
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

eval_dataset = datasets.ImageFolder(
    os.path.dirname("./torch_neuron_test/"),
    transforms.Compose(
        [
            transforms.Resize([224, 224]),
            transforms.ToTensor(),
            normalize,
        ]
    ),
)

image, _ = eval_dataset[0]
image = torch.tensor(image.numpy()[np.newaxis, ...])

## Load model
model_neuron = torch.jit.load("resnet50_neuron.pt")

## Predict
results = model_neuron(image)

# Get the top 5 results
top5_idx = results[0].sort()[1][-5:]

# Lookup and print the top 5 labels
top5_labels = [idx2label[idx] for idx in top5_idx]

print("Top 5 labels:\n {}".format(top5_labels))

This Python code does the following tasks:

It downloads and stores an image of a small kitten.
It fetches the labels for classifying the image.
It then imports this image and normalizes it into a tensor.
It loads our previously created model.
It runs the prediction on our small kitten image.
It gets the top 5 results from the prediction and prints these to the command-line.

We copy this code to the Pod, download our previously uploaded model, and run the following commands:

~$kubectl -n aiml cp ~/environment/eks-workshop/modules/aiml/inferentia/inference/inference.py inference:/

~$kubectl -n aiml exec inference -- pip install --upgrade boto3 botocore

~$kubectl -n aiml exec inference -- aws s3 cp s3://$AIML_NEURON_BUCKET_NAME/resnet50_neuron.pt ./

~$kubectl -n aiml exec inference -- python /inference.py

Top 5 labels:

 ['tiger', 'lynx', 'tiger_cat', 'Egyptian_cat', 'tabby']

As output we get the top 5 labels back. We are running the inference on an image of a small kitten using ResNet-50's pre-trained model, so these results are expected. As a possible next step to improve performance we could create our own data set of images and train our own model for our specific use case. This could improve our prediction results.

This concludes this lab on using AWS Inferentia with Amazon EKS.

Create a pod for inference​

Run an inference​

Create a pod for inference

Run an inference