This document describes how to set up a development environment for developing, building, and testing the SageMaker Spark Container image.
You’ll need to have python, pytest, docker, and docker-compose installed on your machine and on your $PATH.
This repository uses GNU make
to run build targets specified in Makefile
. Consult the Makefile
for the full list of build targets.
- If you do not already have one, create a GitHub account by following the prompts at Join Github.
- Create a fork of this repository on GitHub. You should end up with a fork at
https://github.com/<username>/sagemaker-spark-container
.- Follow the instructions at Fork a Repo to fork a GitHub repository.
- Clone your fork of the repository:
git clone https://github.com/<username>/sagemaker-spark-container
where<username>
is your github username.
- To set up your python environment, we recommend creating and activating a virtual environment
using
venv
, which is part of the Python standard library for python 3+:
python -m venv .venv
source .venv/bin/activate
You may want to activate the Python environment in your .bashrc
or .zshrc
.
- Then install pytest into the virtual environment:
python -m pip install pytest pytest-parallel
- Ensure docker is installed (see: Get Docker | Docker Documentation)
docker
will be used to build and test the Spark container locally
- Ensure you have access to an AWS account i.e. setup your environment such that awscli can access your account via either an IAM user or an IAM role. We recommend an IAM role for use with AWS. For the purposes of testing in your personal account, the following managed permissions should suffice:
-- AmazonSageMakerFullAccess
-- AmazonS3FullAccess
-- AmazonEC2ContainerRegistryFullAccess
-
Create an ECR repository with the name "sagemaker-spark" in the us-west-2 region
-
Setup required environment variables for the container build:
# The AWS account id where the spark image is hosted in ECR
export AWS_ACCOUNT_ID=<YOUR_AWS_ACCOUNT_ID>
# AWS region
export REGION=us-west-2
# The AWS account id which will be used as in integration test acount, can be the same as AWS_ACCOUNT_ID above
export INTEG_TEST_ACCOUNT_ID=<YOUR_INTEGT_TEST_AWS_ACCOUNT_ID>
# Name of the ECR repository where spark image will be hosted
export SPARK_REPOSITORY=sagemaker-spark-processing
# TAG of the spark image
export VERSION=latest
# AWS IAM Role to be used for running the integration tests
export SAGEMAKER_ROLE=<YOUR_INTEG_TEST_SAGEMAKER_ROLE>
# Domain for AWS, for CN regions it is `amazonaws.com.cn` for other regions it is `amazonaws.com`
export AWS_DOMAIN=amazonaws.com
# Partition for AWS, for CN regions it is `aws-cn` for other regions it is `aws`
export AWS_PARTITION=aws
Compiling Scala test JARs requires SBT installed on your system.
Mac users can easily install using Homebrew:
brew install sbt
For more info see https://www.scala-sbt.org/1.x/docs/Setup.html
To compile the Scala test JAR:
make build-test-scala
Compiling Java test JARs requires Maven installed on your system.
Mac users can easily install using Homebrew:
brew install maven
For more info see https://maven.apache.org/install.html
To compile the Java test JAR:
make build-test-java
- To build the container image, run the following command:
make build
Upon successful build, you will see two tags applied to the image. For example:
Successfully tagged sagemaker-spark:2.4-cpu-py37-v0.1
Successfully tagged sagemaker-spark:latest
- To verify that the image is available in your local docker repository, run
docker images
. You should see an image with two tags. For example:
✗ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
sagemaker-spark 2.4-cpu-py37-v0.1 a748a6e042d2 5 minutes ago 3.06GB
sagemaker-spark latest a748a6e042d2 5 minutes ago 3.06GB
To run local tests (unit tests and local container tests using docker compose), run the following command:
make test-local
To run tests against Amazon SageMaker using your newly built container image, first publish the image to your ECR repository.
- Bootstrap docker credentials for your repository
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com
- Tag the latest Spark image
docker tag ${SPARK_REPOSITORY}:latest $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/$SPARK_REPOSITORY:$VERSION
- Push the latest Spark image to your ECR repository
docker push $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/$SPARK_REPOSITORY:$VERSION
- Run the SageMaker tests
make test-sagemaker
- please run following command before you raise CR:
make test-unit
make install-container-library
- You need to create PR request in order to merge the code. How to create PR request lists here:https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request
- You need to get Github access of AWS organization. please following here:https://w.amazon.com/?Open_Source/GitHub
- Get access to permission specific to a team, example is here:https://github.com/orgs/aws/teams/sagemakerwrite/members
- Ask a person to review the code and merge it in.This repo needs at least one code reviewer.
- The code needs to be signed before pushing. More detail about signing is here:https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits.
$ git commit -S -m "your commit message"
6.Remember in your local, you need to set up: git config --global user.signingkey [key id] and also upload public key into your github account. The email you specify when you created public key must match github email in github settings.
ERROR: smspark-0.1-py2-none-any.whl is not a supported wheel on this platform.
- This is because you are switching python version, make sure you are using python 3.* so that sage maker will generate right whl file. The right name is smspark-0.1-py3-none-any.whl. If you find the wrong file, then you need to delete it and then compile will succeed.
Error: TypeError: Descriptors cannot not be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
- Sometimes the protobuf package might be installed without your involvement. For this, you have two solutions to apply. Try one of the below solutions and it should work. pip install protobuf==3.20.* Or export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python, see more detail in this link
- Running test code build is time consuming so before start a new job you can do the following things locally
- make test-unit
- make install-container-library
Fetching EC2 instance type info to ./spark/processing/3.1/py3/aws-config/ec2-instance-type-info.json ... ./scripts/fetch-ec2-instance-type-info.sh: line 32: jq: command not found
- you need to install corresponding packages.
| REPORT |
| checked 94 packages, using free DB (updated once a month) |
+============================+===========+==========================+==========+
| package | installed | affected | ID |
+============================+===========+==========================+==========+
| waitress | 2.1.1 | >=2.1.0,<2.1.2 | 49257 |
+==============================================================================+
make: *** [install-container-library] Error 255
- you need to update smsparkbuild/py39/Pipfile corresponding package version.
- Code build may fail because of the format, for example
2 files would be reformatted, 13 files would be left unchanged.
you can fix it by running
black src/smspark/bootstrapper.py
see https://www.freecodecamp.org/news/auto-format-your-python-code-with-black/ for detail.
- Remember to define module at start of python file. Missing docstring error.
see more detail here https://stackoverflow.com/questions/46192576/how-can-i-fix-flake8-d100-missing-docstring-error-in-atom-editor