Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

created nvidia gpu operator addon #824

Merged
merged 21 commits into from
Aug 25, 2023
Merged

Conversation

Howlla
Copy link
Contributor

@Howlla Howlla commented Aug 17, 2023

Issue #, if available:

Description of changes:
Creates new addon for gpu-operator which extends helm addon. Still working on md file for documentation.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Collaborator

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Howlla Thankyou for taking time to get this addon submitted. Couple of things are missing, please reachout for any questions:

  1. Addon Documentation is missing. Check docs/addons/grafana-operator.md as an example.
  2. Addon is missing in indexing list of documentation.
  3. Addon should be added to e2e in examples/blueprint-construct/index.ts
  4. mkdocs.yaml should be updated with new addon.
  5. Also see if you have opportunity to add unit tests under tests

overall great work. Are you planning to do gpu-builder in separate PR?

lib/addons/gpu-operator/index.ts Outdated Show resolved Hide resolved
lib/addons/gpu-operator/index.ts Outdated Show resolved Hide resolved
@Howlla
Copy link
Contributor Author

Howlla commented Aug 18, 2023

Created builder and addon with documentation. Also added it to the e2e test

@elamaran11 i was not sure whether to add the gpu operator addon in the builder itself or will we do it through the addon. It is a small change in line 124 of gpu-builder.ts

Copy link
Collaborator

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Howlla Looks good to me, great work on the Builder class on Addon. Following needs to be done :

  1. Builder class doc is missing
  2. Also builder class should be added to mkdocs.
  3. Plus the builder class is not aligned to functionality we discussed yesterday. So please make sure you add GPU Addon, have a method like enableGPU and parametrize GPU like AmpAddon.

]
})
)
.addOns(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should have GPU Addon as part of the builder. You should also be adding other required addons for the cluster which you used in your sample runs like coredns, kube-proxy etc etc to minimally run the solution. You can follow observability builder as an example to have a method like enableGPU and have all addons. Also as i mentioned before, you should think about grabbing all values as parameters from pattern you are building to builder class to pass to GPU Addon like AmpAddon.

mkdocs.yml Show resolved Hide resolved
Copy link
Collaborator

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Howlla We are on track. Just couple of comments.

const region = process.env.CDK_DEFAULT_REGION!;
const stackID = `${id}-eks-blueprint`;

const nodeRole = new blueprints.CreateRoleProvider("blueprint-node-role", new iam.ServicePrincipal("ec2.amazonaws.com"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required. We needed that for windows for different reason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, should i remove the nodeRole arguments from gpu-builder.ts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeRole itself is not required here because its taken care by blueprints. This was specific to make windows work. I think you should follow the Observability Builder convention.

lib/builders/gpu-builder.ts Outdated Show resolved Hide resolved
Copy link
Collaborator

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shapirov103 This PR looks near to complete with couple of comments open. Can you take a look in the PR, provide feedback so we can merge this with Release?

Copy link
Collaborator

@shapirov103 shapirov103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Howlla great work, I added a few comments, there is a bit of work left, happy to discuss.
Please also make sure you run standard commands for PR submission such as make lint. This process is described here: https://github.com/aws-quickstart/cdk-eks-blueprints/blob/main/docs/internal/readme-internal.md#submitting-pull-requests

.github/workflows/linkcheck.json Outdated Show resolved Hide resolved
lib/builders/gpu-builder.ts Show resolved Hide resolved
lib/builders/gpu-builder.ts Outdated Show resolved Hide resolved
* This method helps you prepare a blueprint for setting up observability
* returning an array of blueprint addons for AWS managed open source services
*/
public enableGpu(values?: blueprints.Values): GpuBuilder {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these values? it is going confusing to the customers which values are allowed to be passed. the way it stands, it is just key value pairs and no indication of the schema. Did you mean to allow passing some particular type here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added valueSchema similar to container insights values

* This method helps you prepare a blueprint for setting up windows nodes with
* usage tracking addon
*/
public static builder(): GpuBuilder {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any constructor args to allow customers to pass here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing gpuOptions now

docs/addons/gpu-operator.md Show resolved Hide resolved
@Howlla
Copy link
Contributor Author

Howlla commented Aug 22, 2023

Fixed the issues @shapirov103

Copy link
Collaborator

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Howlla The gpu-builder.ts should be reworked based on my current PR for observability, Also you should add the builder to index.ts in builder folder in lib. Please let us know once done, we can review.

lib/builders/gpu-builder.ts Outdated Show resolved Hide resolved
lib/builders/gpu-builder.ts Outdated Show resolved Hide resolved
@Howlla
Copy link
Contributor Author

Howlla commented Aug 22, 2023

made changes @elamaran11

elamaran11
elamaran11 previously approved these changes Aug 23, 2023
Copy link
Collaborator

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Howlla Did you merge from the branch in my PR. I think i have most of these changes for builders plus few other changes in my PR needs to be merged, before we can move in. @shapirov103 Please check

@Howlla
Copy link
Contributor Author

Howlla commented Aug 23, 2023

I havent merged from your yours but i made sure no conflicts occur. Let me know if you want me to create a branch with both ghaction and gpu builder code

@elamaran11
Copy link
Collaborator

I havent merged from your yours but i made sure no conflicts occur. Let me know if you want me to create a branch with both ghaction and gpu builder code

You can do couple of things. One merge from my branch or merge from main after my PR is merged.

@Howlla
Copy link
Contributor Author

Howlla commented Aug 23, 2023

Merged with main branch and resolved conflicts

Copy link
Collaborator

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @shapirov103 Please check and merge this.,

Copy link
Collaborator

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more minor feedback.

lib/builders/gpu-builder.ts Outdated Show resolved Hide resolved
elamaran11
elamaran11 previously approved these changes Aug 25, 2023
Copy link
Collaborator

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @shapirov103 Please review this, we can move to merge.

shapirov103
shapirov103 previously approved these changes Aug 25, 2023
Copy link
Collaborator

@shapirov103 shapirov103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Howlla Great work!
I added a question that should not affect the PR status, i will merge after e2e.

WITH_REBOOT: true
},
toolkit: {
version: 'v1.13.1-centos7'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a mandatory parameter? I am curious how customers (and us) will update this moving forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This toolkit version was verified as working with AL2 by Re Alvarez

@shapirov103
Copy link
Collaborator

/do-e2e-tests

@Howlla Howlla dismissed stale reviews from shapirov103 and elamaran11 via 8ace408 August 25, 2023 19:52
@shapirov103
Copy link
Collaborator

/do-e2e-tests

Copy link

@aws-ia-ci aws-ia-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end to end tests passed

@shapirov103 shapirov103 merged commit 65f76b4 into aws-quickstart:main Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants