Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neuron Device Plugin Addon #777

Merged
merged 14 commits into from
Feb 14, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions docs/addons/neuron-plugin-addon.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Neuron Device Plugin Addon

[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the SDK used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances. This addon will install the Neuron Device Plugin necessary to run the instances on Amazon EKS (and Blueprints). Note that you **must** use *inf1, inf2, trn1,* or *trn1n* instances.

## Usage

#### **`index.ts`**
```typescript
import 'source-map-support/register';
import * as cdk from 'aws-cdk-lib';
import * as blueprints from '@aws-quickstart/eks-blueprints';

const app = new cdk.App();

const addOn = new blueprints.addons.NeuronPluginAddon();

const clusterProvider = new blueprints.GenericClusterProvider({
version: KubernetesVersion.V1_27,
managedNodeGroups: [
inferentiaNodeGroup()
]
});

function inferentiaNodeGroup(): blueprints.ManagedNodeGroup {
return {
id: "mng1",
instanceTypes: [new ec2.InstanceType('inf1.2xlarge')],
desiredSize: 1,
maxSize: 2,
nodeGroupSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
};
}

const blueprint = blueprints.EksBlueprint.builder()
.clusterProvider(clusterProvider)
.addOns(addOn)
.build(app, 'my-stack-name');
```

Once deployed, you can see the plugin daemonset in the `kube-system` namespace.

```sh
$ kubectl get daemonset neuron-device-plugin-daemonset -n kube-system

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
neuron-device-plugin-daemonset 1 1 1 1 1 <none> 24m 20m
```

## Functionality

1. Deploys the plugin daemonset in `kube-system` namespace by default.
2. Provides a plugin for the blueprint to leverage the Inferentia or Trainium instances to use the Neuron SDK.
23 changes: 22 additions & 1 deletion examples/blueprint-construct/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@ export default class BlueprintConstruct {
efsFileSystem: 'apache-airflow-efs-provider'
}),
new blueprints.ExternalsSecretsAddOn(),
new blueprints.NeuronPluginAddOn(),
];

// Instantiated to for helm version check.
Expand All @@ -232,7 +233,8 @@ export default class BlueprintConstruct {
addGenericNodeGroup(),
addCustomNodeGroup(),
addWindowsNodeGroup(), // commented out to check the impact on e2e
addGpuNodeGroup()
addGpuNodeGroup(),
addInferentiaNodeGroup(),
]
});

Expand Down Expand Up @@ -397,4 +399,23 @@ function addGpuNodeGroup(): blueprints.ManagedNodeGroup {
};
}

function addInferentiaNodeGroup(): blueprints.ManagedNodeGroup {

return {
id: "mng4-inferentia",
instanceTypes: [new ec2.InstanceType('inf1.2xlarge')],
desiredSize: 1,
minSize: 1,
nodeRole: blueprints.getNamedResource("node-role") as iam.Role,
diskSize: 50,
tags: {
"Name": "Mng4",
"Type": "Managed-InferentiaNode-Group",
"LaunchTemplate": "Inferentia",
"kubernetes.io/cluster/blueprint-construct-dev": "owned"
}
};
}



154 changes: 154 additions & 0 deletions examples/young-construct/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
import * as cdk from 'aws-cdk-lib';
elamaran11 marked this conversation as resolved.
Show resolved Hide resolved
import * as ec2 from "aws-cdk-lib/aws-ec2";
import { KubernetesVersion, NodegroupAmiType } from 'aws-cdk-lib/aws-eks';
import { AccountRootPrincipal, Role } from 'aws-cdk-lib/aws-iam';
import { Construct } from "constructs";
import * as blueprints from '../../lib';
import { logger, userLog } from '../../lib/utils';

export interface BlueprintConstructProps {
/**
* Id
*/
id: string
}

export default class YoungConstruct {
constructor(scope: Construct, props: cdk.StackProps) {

blueprints.HelmAddOn.validateHelmVersions = true;
blueprints.HelmAddOn.failOnVersionValidation = false;
logger.settings.minLevel = 3;
userLog.settings.minLevel = 2;

const vpc = new blueprints.VpcProvider(undefined, {
primaryCidr: "10.2.0.0/16",
secondaryCidr: "100.64.0.0/16",
secondarySubnetCidrs: ["100.64.0.0/24","100.64.1.0/24","100.64.2.0/24"]
});
// const airflowEfs = new blueprints.CreateEfsFileSystemProvider({
// name: "airflow-efs-file-system"
// });
const apacheAirflowS3Bucket = new blueprints.CreateS3BucketProvider({
id: 'apache-airflow-s3-bucket-id',
s3BucketProps: { removalPolicy: cdk.RemovalPolicy.DESTROY }
});

const teams: Array<blueprints.Team> = [];
const addOns: Array<blueprints.ClusterAddOn> = [
new blueprints.addons.AwsLoadBalancerControllerAddOn(),
new blueprints.addons.CoreDnsAddOn(),
new blueprints.addons.KubeProxyAddOn(),
new blueprints.addons.SSMAgentAddOn(),
// new blueprints.addons.KarpenterAddOn({
// requirements: [
// { key: 'node.kubernetes.io/instance-type', op: 'In', vals: ['m5.2xlarge'] },
// { key: 'topology.kubernetes.io/zone', op: 'NotIn', vals: ['us-west-2c']},
// { key: 'kubernetes.io/arch', op: 'In', vals: ['amd64','arm64']},
// { key: 'karpenter.sh/capacity-type', op: 'In', vals: ['spot']},
// ],
// subnetTags: {
// "Name": "blueprint-construct-dev/blueprint-construct-dev-vpc/PrivateSubnet1",
// },
// securityGroupTags: {
// "kubernetes.io/cluster/blueprint-construct-dev": "owned",
// },
// taints: [{
// key: "workload",
// value: "test",
// effect: "NoSchedule",
// }],
// consolidation: { enabled: true },
// ttlSecondsUntilExpired: 2592000,
// weight: 20,
// interruptionHandling: true,
// limits: {
// resources: {
// cpu: 20,
// memory: "64Gi",
// }
// },
// }),
new blueprints.addons.EfsCsiDriverAddOn({replicaCount: 1}),
new blueprints.addons.EbsCsiDriverAddOn(),
// new blueprints.addons.JupyterHubAddOn({
// efsConfig: {
// pvcName: "efs-persist",
// removalPolicy: cdk.RemovalPolicy.DESTROY,
// capacity: '10Gi',
// },
// serviceType: blueprints.JupyterHubServiceType.CLUSTERIP,
// notebookStack: 'jupyter/datascience-notebook',
// values: { prePuller: { hook: { enabled: false }}}
// }),
// new blueprints.AwsBatchAddOn(),
new blueprints.AwsForFluentBitAddOn(),
// new blueprints.AirflowAddOn({
// enableLogging: true,
// s3BucketName: airflowS3bucket.name
// // enableRds: true,
// // dbConfig: {
// // username: "airflow-user",
// // password: "PA$$w0rd123",
// // dbName: "airflow",
// // },
// // enableEfs: true,
// // efsFileSystemName: airflowEfs.options.name!,
// }),
new blueprints.ApacheAirflowAddOn({
enableLogging: true,
s3Bucket: 'airflow-logging-s3-bucket',
// enableEfs: true,
// efsFileSystem: 'apache-airflow-efs-provider',
})
];

const blueprintID = 'young-blueprint-test';

const clusterProvider = new blueprints.GenericClusterProvider({
version: KubernetesVersion.V1_25,
mastersRole: blueprints.getResource(context => {
return new Role(context.scope, 'AdminRole', { assumedBy: new AccountRootPrincipal() });
}),
managedNodeGroups: [
{
id: "mng1",
amiType: NodegroupAmiType.AL2_X86_64,
// amiReleaseVersion: "",
instanceTypes: [new ec2.InstanceType('m5.4xlarge')],
diskSize: 25,
desiredSize: 2,
maxSize: 3,
nodeGroupSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS }
}
]
});

// const batchTeam: blueprints.BatchEksTeamProps = {
// name: 'batch-a',
// namespace: 'aws-batch',
// envName: 'batch-a-comp-env',
// computeResources: {
// envType: blueprints.BatchEnvType.EC2,
// allocationStrategy: blueprints.BatchAllocationStrategy.BEST,
// priority: 10,
// minvCpus: 0,
// maxvCpus: 128,
// instanceTypes: ["m5", "c4.4xlarge"]
// },
// jobQueueName: 'team-a-job-queue',
// };

blueprints.EksBlueprint.builder()
.resourceProvider(blueprints.GlobalResources.Vpc, vpc)
.resourceProvider('airflow-logging-s3-bucket', apacheAirflowS3Bucket)
// .resourceProvider('airflow-efs-file-system', airflowEfs)
.addOns(...addOns)
.clusterProvider(clusterProvider)
.teams(...teams,
// new blueprints.BatchEksTeam(batchTeam)
)
// .enableControlPlaneLogTypes(blueprints.ControlPlaneLogType.API)
.build(scope, blueprintID, props);
}
}
1 change: 1 addition & 0 deletions lib/addons/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ export * from './emr-on-eks';
export * from './aws-batch-on-eks';
export * from './upbound-universal-crossplane';
export * from './apache-airflow';
export * from './neuron';

export class Constants {
public static readonly BLUEPRINTS_ADDON = "blueprints-addon";
Expand Down
39 changes: 39 additions & 0 deletions lib/addons/neuron/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import { Construct } from "constructs";

import { ClusterAddOn, ClusterInfo } from "../../spi";
import { KubectlProvider, ManifestDeployment } from "../helm-addon/kubectl-provider";
import { loadMultiResourceExternalYaml } from "../../utils/yaml-utils";

const PLUGIN_URL = "https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin.yml";
const RBAC_URL = "https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin-rbac.yml";

export class NeuronPluginAddOn implements ClusterAddOn {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No options to deploy? Not even namespace? It is fine if ns should be kube-system, just want to check if anything is reasonable to expose for configuration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this one, it is straight forward. There may be some optional scheduler which I just saw, that I will do with options in a fast follow-up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind creating an issue and assigning to you or Riccardo if you want to do a fast follow-up for options

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do it in this PR actually, testing it right now.

deploy(clusterInfo: ClusterInfo): Promise<Construct> {
const kubectlProvider = new KubectlProvider(clusterInfo);

// Read in YAML docs
const rbac = loadMultiResourceExternalYaml(RBAC_URL);
const rbacManifest: ManifestDeployment = {
name: "neuron-rbac-manifest",
namespace: "",
manifest: rbac,
values: {}
};

const plugin = loadMultiResourceExternalYaml(PLUGIN_URL);
const pluginManifest: ManifestDeployment = {
name: "neuron-plugin-manifest",
namespace: "kube-system",
manifest: plugin,
values: {}
};

const rbacStatement = kubectlProvider.addManifest(rbacManifest);
const pluginStatement = kubectlProvider.addManifest(pluginManifest);

// Plugin dependency on the RBAC manifest
pluginStatement.node.addDependency(rbacStatement);

return Promise.resolve(pluginStatement);
}
}
9 changes: 9 additions & 0 deletions lib/utils/yaml-utils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,15 @@ export function readYamlDocument(path: string): string {
}
}

export function loadMultiResourceYaml(path: string): any {
elamaran11 marked this conversation as resolved.
Show resolved Hide resolved
const doc = readYamlDocument(path);
return doc.split("---").map((e: any) => loadYaml(e));
}

export function loadMultiResourceExternalYaml(url: string): any {
const doc = loadExternalYaml(url);
return doc;
}

export function loadYaml(document: string): any {
return yaml.load(document);
Expand Down
6 changes: 6 additions & 0 deletions test/utils/multi-yaml-test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
kind: ClusterRole
---
kind: Deployment
---
kind: Pod
1 change: 1 addition & 0 deletions test/utils/yaml-test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
apiVersion: apps/v1
Loading