Skip to content

Latest commit

 

History

History
218 lines (154 loc) · 9.4 KB

DEVELOPMENT.md

File metadata and controls

218 lines (154 loc) · 9.4 KB

Development

This document describes how to set up a development environment for developing, building, and testing the SageMaker Spark Container image.

Development Environment Setup

You’ll need to have python, pytest, docker, and docker-compose installed on your machine and on your $PATH.

This repository uses GNU make to run build targets specified in Makefile. Consult the Makefile for the full list of build targets.

Pulling Down the Code

  1. If you do not already have one, create a GitHub account by following the prompts at Join Github.
  2. Create a fork of this repository on GitHub. You should end up with a fork at https://github.com/<username>/sagemaker-spark-container.
    1. Follow the instructions at Fork a Repo to fork a GitHub repository.
  3. Clone your fork of the repository: git clone https://github.com/<username>/sagemaker-spark-container where <username> is your github username.

Setting Up The Development Environment

  1. To set up your python environment, we recommend creating and activating a virtual environment using venv, which is part of the Python standard library for python 3+:
python -m venv .venv
source .venv/bin/activate

You may want to activate the Python environment in your .bashrc or .zshrc.

  1. Then install pytest into the virtual environment:

python -m pip install pytest pytest-parallel

  1. Ensure docker is installed (see: Get Docker | Docker Documentation)

docker will be used to build and test the Spark container locally

  1. Ensure you have access to an AWS account i.e. setup your environment such that awscli can access your account via either an IAM user or an IAM role. We recommend an IAM role for use with AWS. For the purposes of testing in your personal account, the following managed permissions should suffice:

-- AmazonSageMakerFullAccess
-- AmazonS3FullAccess
-- AmazonEC2ContainerRegistryFullAccess

  1. Create an ECR repository with the name "sagemaker-spark" in the us-west-2 region

  2. Setup required environment variables for the container build:

# The AWS account id where the spark image is hosted in ECR
export AWS_ACCOUNT_ID=<YOUR_AWS_ACCOUNT_ID>

# AWS region 
export REGION=us-west-2

# The AWS account id which will be used as in integration test acount, can be the same as AWS_ACCOUNT_ID above               
export INTEG_TEST_ACCOUNT_ID=<YOUR_INTEGT_TEST_AWS_ACCOUNT_ID>

# Name of the ECR repository where spark image will be hosted 
export SPARK_REPOSITORY=sagemaker-spark-processing

# TAG of the spark image
export VERSION=latest

# AWS IAM Role to be used for running the integration tests
export SAGEMAKER_ROLE=<YOUR_INTEG_TEST_SAGEMAKER_ROLE>

# Domain for AWS, for CN regions it is `amazonaws.com.cn` for other regions it is `amazonaws.com` 
export AWS_DOMAIN=amazonaws.com

# Partition for AWS, for CN regions it is `aws-cn` for other regions it is `aws`
export AWS_PARTITION=aws

Building Scala Test Dependencies

Compiling Scala test JARs requires SBT installed on your system.

Mac users can easily install using Homebrew: brew install sbt

For more info see https://www.scala-sbt.org/1.x/docs/Setup.html

To compile the Scala test JAR: make build-test-scala

Building Java Test Dependencies

Compiling Java test JARs requires Maven installed on your system.

Mac users can easily install using Homebrew: brew install maven

For more info see https://maven.apache.org/install.html

To compile the Java test JAR: make build-test-java

Building Your Image

  1. To build the container image, run the following command:
make build

Upon successful build, you will see two tags applied to the image. For example:

Successfully tagged sagemaker-spark:2.4-cpu-py37-v0.1
Successfully tagged sagemaker-spark:latest
  1. To verify that the image is available in your local docker repository, run docker images. You should see an image with two tags. For example:
✗ docker images
REPOSITORY                                                                             TAG                   IMAGE ID            CREATED             SIZE
sagemaker-spark                                                                        2.4-cpu-py37-v0.1     a748a6e042d2        5 minutes ago        3.06GB
sagemaker-spark                                                                        latest                a748a6e042d2        5 minutes ago        3.06GB

Running Local Tests

To run local tests (unit tests and local container tests using docker compose), run the following command:

make test-local

Running SageMaker Tests

To run tests against Amazon SageMaker using your newly built container image, first publish the image to your ECR repository.

  1. Bootstrap docker credentials for your repository
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com
  1. Tag the latest Spark image
docker tag ${SPARK_REPOSITORY}:latest $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/$SPARK_REPOSITORY:$VERSION
  1. Push the latest Spark image to your ECR repository
docker push $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/$SPARK_REPOSITORY:$VERSION
  1. Run the SageMaker tests
make test-sagemaker
  1. please run following command before you raise CR:
make test-unit
make install-container-library

Push the code

  1. You need to create PR request in order to merge the code. How to create PR request lists here:https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request
  2. You need to get Github access of AWS organization. please following here:https://w.amazon.com/?Open_Source/GitHub
  3. Get access to permission specific to a team, example is here:https://github.com/orgs/aws/teams/sagemakerwrite/members
  4. Ask a person to review the code and merge it in.This repo needs at least one code reviewer.
  5. The code needs to be signed before pushing. More detail about signing is here:https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits.
$ git commit -S -m "your commit message"

6.Remember in your local, you need to set up: git config --global user.signingkey [key id] and also upload public key into your github account. The email you specify when you created public key must match github email in github settings.

FAQ

  1. ERROR: smspark-0.1-py2-none-any.whl is not a supported wheel on this platform.
  • This is because you are switching python version, make sure you are using python 3.* so that sage maker will generate right whl file. The right name is smspark-0.1-py3-none-any.whl. If you find the wrong file, then you need to delete it and then compile will succeed.
  1. Error: TypeError: Descriptors cannot not be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
  • Sometimes the protobuf package might be installed without your involvement. For this, you have two solutions to apply. Try one of the below solutions and it should work. pip install protobuf==3.20.* Or export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python, see more detail in this link
  1. Running test code build is time consuming so before start a new job you can do the following things locally
  • make test-unit
  • make install-container-library
  1. Fetching EC2 instance type info to ./spark/processing/3.1/py3/aws-config/ec2-instance-type-info.json ... ./scripts/fetch-ec2-instance-type-info.sh: line 32: jq: command not found
  • you need to install corresponding packages.
| REPORT                                                                       |
| checked 94 packages, using free DB (updated once a month)                    |
+============================+===========+==========================+==========+
| package                    | installed | affected                 | ID       |
+============================+===========+==========================+==========+
| waitress                   | 2.1.1     | >=2.1.0,<2.1.2           | 49257    |
+==============================================================================+
make: *** [install-container-library] Error 255
  • you need to update smsparkbuild/py39/Pipfile corresponding package version.
  1. Code build may fail because of the format, for example
2 files would be reformatted, 13 files would be left unchanged.

you can fix it by running

black src/smspark/bootstrapper.py

see https://www.freecodecamp.org/news/auto-format-your-python-code-with-black/ for detail.

  1. Remember to define module at start of python file. Missing docstring error.

see more detail here https://stackoverflow.com/questions/46192576/how-can-i-fix-flake8-d100-missing-docstring-error-in-atom-editor