Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for usetting pod identity with spark-run #4016

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Qmando
Copy link
Member

@Qmando Qmando commented Mar 4, 2025

As described in an internal tech spec document. Support for using pod identity for manual spark-runs.

Pairs with Yelp/service_configuration_lib#155

  1. Check yelpsoa-configs for IAM roles assigned to a particular service when running spark-run. Use those to create a service account, and pass that to the spark conf builder
  2. Allow you to override this, but it must be a valid role for at least one instance of the service
  3. When this is used, don't send IAM credentials to the spark conf builder. They are still passed to configure_and_run_docker_container, where they will be used for the driver
  4. If no aws role/profile/user is specified, use a default one (mrjob as placeholder. this will change in the near future to the spark-driver-k8s user) for the driver. This user can list files in S3 and write to the spark logs buckets.

Test run using spark integration test, as if iam_role set: https://fluffy.yelpcorp.com/i/mmcV6DNvz0zTbC6VJScBcFjKw3Dd466L.html
Test run with no iam_role set, --aws-credentials-yaml provided instead: https://fluffy.yelpcorp.com/i/WDm3kj0R3sg9rVqMkzMmzlJVpSSRhBNc.html
Test with unauthorized forced pod identity: https://fluffy.yelpcorp.com/i/XQR1VpgjCQXLfbd6CfJbdGkwGlKX4ZS5.html

Deployment plan:
If this gets merged, I'll update package version in dev and verify all the tron-based spark jobs are still succeeding as normal.

@Qmando Qmando force-pushed the spark_run_pod_identity_executor branch from 930b83e to 4bbaaa0 Compare March 24, 2025 21:15
@Qmando Qmando marked this pull request as ready for review March 24, 2025 22:28
nemacysts
nemacysts previously approved these changes Mar 24, 2025
@nemacysts
Copy link
Member

whoops, meant to add this note to my review:

@Qmando: probably wanna wait for someone on ml-compute to ship this as well since they own this, but lgtm :)

@Qmando Qmando requested review from 88manpreet and removed request for 88manpreet March 24, 2025 23:39
nurdann
nurdann previously approved these changes Mar 25, 2025
@chi-yelp
Copy link
Contributor

  1. Check yelpsoa-configs for IAM roles assigned to a particular service when running spark-run. Use those to create a service account, and pass that to the spark conf builder

Just to confirm - we're only getting the service accounts names here for executor pod identity, not creating them right?

chi-yelp
chi-yelp previously approved these changes Mar 25, 2025
@nurdann
Copy link
Member

nurdann commented Mar 25, 2025

  1. Check yelpsoa-configs for IAM roles assigned to a particular service when running spark-run. Use those to create a service account, and pass that to the spark conf builder

Just to confirm - we're only getting the service accounts names here for executor pod identity, not creating them right?

Yes, even currently you can only use SA that already exists on the namespace. This further filters it so that only IAM roles specified in yelpsoa-configs can be used.

@Qmando Qmando dismissed stale reviews from chi-yelp, nurdann, and nemacysts via b37d8b4 March 25, 2025 21:22
@Qmando Qmando requested a review from a team as a code owner March 25, 2025 22:52
nemacysts
nemacysts previously approved these changes Mar 25, 2025
chi-yelp
chi-yelp previously approved these changes Mar 26, 2025
@Qmando Qmando dismissed stale reviews from chi-yelp and nemacysts via 5d3cfb9 March 27, 2025 22:17
@Qmando
Copy link
Member Author

Qmando commented Mar 27, 2025

@nemacysts @chi-yelp
I've put all the change in behavior behind a new arg, --executor-pod-identity in the latest commit. I was hoping to avoid having to do this, but I realized there will be cases where this change could break existing jobs. See https://yelp.slack.com/archives/C08920ZF0Q2/p1743024040622489 for more details.

This change makes this PR much safer to deploy, but it also means we'll need to be more thorough with manual testing. I hope to deprecate it in the future when everything is setup correctly.

@Qmando Qmando requested review from nemacysts and chi-yelp March 27, 2025 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants