Skip to content

Commit

Permalink
control-service: job-builder using kaniko fix (#2429)
Browse files Browse the repository at this point in the history
The way we are preparing our job-builder image is like this:

```
FROM gcr.io/kaniko-project/executor

FROM alpine

COPY --from=0 /kaniko /kaniko
```

This is apparently known issue and caused this outage described in #2391
 
https://github.com/GoogleContainerTools/kaniko#known-issues

> Running kaniko in any Docker image other than the official kaniko
image is not supported (ie YMMV).
> This includes copying the kaniko executables from the official image
into another image.

So instead making sure we built from kaniko image directly should fix
the issue

See #2391

Testing Done: Beyond automated tests I also tested it in one of
environments where the issue in #2391 reproduce and verified with the
new image the jobs are correctly built.

---------

Co-authored-by: github-actions <>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
antoniivanov and pre-commit-ci[bot] authored Jul 22, 2023
1 parent fcb48da commit 2946b23
Show file tree
Hide file tree
Showing 12 changed files with 280 additions and 86 deletions.
23 changes: 1 addition & 22 deletions projects/control-service/projects/job-builder/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,32 +1,11 @@
# Used to trigger a build for a data job image.

FROM gcr.io/kaniko-project/executor

FROM alpine

COPY --from=0 /kaniko /kaniko


ENV PATH $PATH:/kaniko
ENV SSL_CERT_DIR=/kaniko/ssl/certs
ENV DOCKER_CONFIG /kaniko/.docker/
FROM gcr.io/kaniko-project/executor:debug

WORKDIR /workspace

COPY Dockerfile.python.vdk /workspace/Dockerfile
COPY build_image.sh /build_image.sh
RUN chmod +x /build_image.sh


# Setup Python and Git
## Update & Install dependencies
RUN apk add --no-cache --update \
git \
bash

RUN apk add --no-cache --repository http://dl-cdn.alpinelinux.org/alpine/v3.10/main python3=3.7.10-r0 py3-pip \
&& pip3 install awscli \
&& apk --purge -v del py3-pip \
&& rm -rf /var/cache/apk/*

ENTRYPOINT ["/build_image.sh"]
29 changes: 28 additions & 1 deletion projects/control-service/projects/job-builder/README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,29 @@
# Job Builder
This package provides a way to configure and build your own Data Job images.

Job Builder is a component that helps the Control Service build data job images.
It forms an essential part of the setup and installation of the Control Service and is used by Control Service Deployment APIs.

See [this activity diagram for reference](https://www.plantuml.com/plantuml/svg/bP9FJzj04CNl_XHFzD0WjLyW7CgVeAeKgGWgSTxOOpDbFUFExaAjgj-ztcnZGGX8lIIhdNc_UM_Mno4wYwdtLRXd6Pov7YTrv0UE8tvN073gwllED4bpfbuDtvFzJCg1IbMj8IkLKp-rLd-gFQmLkrwbUGLvoTrTFFNfBMJsMLNLyYOVi4xi6vOEZOiEFtGDxbr7HnMtMAoK0XnMkNInBO5-SOXerH3lE6JD-u07ii0gdmwdIn8iHWg76nFlhfodpqOao_CipBCAfys-Fo147J2OrXJ2qKQIRohoWR0GBPJbcPEQFEWVelWcAuvRS9myM1APQWMI_NE0KJSfR4GS1yB9xGqEpi-k3vxvH1QKAKOk4gOEr2hHiP31QD30KIT82KtpiigevrP96cwBwKkNfBw3Wz1ZSGmLV4rhCg580IbCVZCnmpvkCvKNm1pZreLj3mebf3glgqt-vS9tbdunYshj1q-HdihzUBHFjABScCbF5rrQvnTw4RrG1fRx9Quf6jC3mMiNeEthB6uNNqe-CbD3xLAW1kjnSvVjVtiKih0dwSxC6v86ef5RhbrabGchEvHvxawGdJ3_HR_oBhPgFKwQdkNj4VF7KKxbztZwIxt_2m00)

The Data Job Build process executed by the job-builder image goes over those steps:
1. It uses Git to fetch the data job source code for given version (git commit)
- Each top level folder in the Git repo represents a single data job
2. It installs required python dependecies of the job found in requirements.txt in job's root directory
3. It uses Kaniko to build the Data Job Docker image and push it to a container registry.

Upon failure the Control Service inspects the logs for errors.It looks specifacally for
- logs containing `>requirements_failed<` indicating it fails to install a job python requirement
- logs containing `>data-job-not-found<` or `failed to get files used from context` indicating user is trying to deploy job that no longer exists

In both cases it reports user error and sends notification to job owners (if such are configured)

In all other cases it sends notifications to Control Service administrators

## Installation

To install the job-builder-image, you would configure it in the Control Service usually in production using helm chart.

In helm chart `deploymentBuilderImage` configuration options control which builder image is used.
[More details here](https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/helm_charts/pipelines-control-service/values.yaml#L51)

Locally for debugging purposes you can run it with `docker run`
51 changes: 18 additions & 33 deletions projects/control-service/projects/job-builder/build_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ aws_access_key_id=$1
aws_secret_access_key=$2
aws_region=$3
docker_registry=$4
git_username=$5
git_password=$6
export GIT_USERNAME="$5"
export GIT_PASSWORD="$6"
git_repository=$7
registry_type=$8
registry_username=$9
Expand All @@ -23,28 +23,20 @@ echo "AWS_REGION=$aws_region"
echo "DOCKER_REGISTRY=$docker_registry"
echo "GIT_REPOSITORY=$git_repository"
echo "REGISTRY_TYPE=$registry_type"
# We default to generic repo.
# We have special support for ECR because
# even though Kaniko supports building and pushing images to ECR
# it doesn't create repository nor do they think they should support it -
# https://github.com/GoogleContainerTools/kaniko/pull/1537
# And ECR requires for each image to create separate repository
# And ECR will not create new image repository on docker push
# So we need to do it manually.

GIT_BRANCH=${GIT_BRANCH:-"master"}

# The ECR repository must have been created before calling this script
# Or the image push will fail
if [ "$registry_type" = "ecr" ] || [ "$registry_type" = "ECR" ] ; then
# Setup credentials to connect to AWS - same creds will be used by kaniko as well.
aws configure set aws_access_key_id $aws_access_key_id
aws configure set aws_secret_access_key $aws_secret_access_key
export AWS_ACCESS_KEY_ID="$aws_access_key_id"
export AWS_SECRET_ACCESS_KEY="$aws_secret_access_key"

# Check if aws_session_token is set and not empty.
if [ -n "$aws_session_token" ] ; then
aws configure set aws_session_token "$aws_session_token"
export AWS_SESSION_TOKEN="$aws_session_token"
fi
# https://stackoverflow.com/questions/1199613/extract-filename-and-path-from-url-in-bash-script
repository_prefix=${docker_registry#*/}
# Create docker repository if it does not exist
aws ecr describe-repositories --region $aws_region --repository-names $repository_prefix/${DATA_JOB_NAME} ||
aws ecr create-repository --region $aws_region --repository-name $repository_prefix/${DATA_JOB_NAME}
echo '{ "credsStore": "ecr-login" }' > /kaniko/.docker/config.json
elif [ "$registry_type" = "generic" ] || [ "$registry_type" = "GENERIC" ]; then
export auth=$(echo -n $registry_username:$registry_password | base64 -w 0)
Expand All @@ -62,23 +54,16 @@ cat > /kaniko/.docker/config.json <<- EOM
EOM
#cat /kaniko/.docker/config.json
fi
# Clone repo into /data-jobs dir to get job's source
git_url_scheme="https"
[ "$GIT_SSL_ENABLED" = false ] && git_url_scheme="http"
git clone $git_url_scheme://$git_username:$git_password@$git_repository ./data-jobs
cd ./data-jobs
git reset --hard $GIT_COMMIT || ( echo ">data-job-not-found<" && exit 1 )
if [ ! -d ${DATA_JOB_NAME} ]; then
echo ">data-job-not-found<"
exit 1
fi
cd ..
# kaniko supports building directly from git repository but as we've cloned it here anyhow
# (to check if job directory exists) there's no need to.
/kaniko/executor \

export GIT_URL="git://$git_repository#refs/heads/$GIT_BRANCH#$GIT_COMMIT"
echo "GIT_URL is $GIT_URL"

/kaniko/executor --log-timestamp=true --single-snapshot \
--dockerfile=/workspace/Dockerfile \
--destination="${IMAGE_REGISTRY_PATH}/${DATA_JOB_NAME}:${GIT_COMMIT}" \
--build-arg=job_githash="$JOB_GITHASH" \
--build-arg=base_image="$BASE_IMAGE" \
--build-arg=job_name="$JOB_NAME" \
--context=./data-jobs $EXTRA_ARGUMENTS
--context="$GIT_URL" $EXTRA_ARGUMENTS

# Hint: nice EXTRA_ARGUMENTS for kaniko when debugging are --verbosity=debug or --verbosity=trace
2 changes: 1 addition & 1 deletion projects/control-service/projects/job-builder/version.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.3.4
1.4.0
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,10 @@ public enum MainExternalSystem {
GIT("Git"),

/** A secrets storage/repository used to store data jobs secrets. */
SECRETS("Secrets");
SECRETS("Secrets"),

/** A container registry (like dockerhub or ECR) used to store container images. */
CONTAINER_REGISTRY("Container Registry");

private final String userFacingName;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,13 @@ public void verifyBuilderResult(
userErrorMessage,
sendNotification);
} else {
if (logs.contains("error resolving source context: reference not found")) {
log.error(
"Job Builder image failed to clone the git repository: "
+ "double check the git configuration including git url, credentials and branch."
+ "If this not not the root cause. See next error message");
}

ErrorMessage message =
new ErrorMessage(
String.format(
Expand Down Expand Up @@ -98,7 +105,9 @@ private String getUserErrorMessage(String logs, JobDeployment jobDeployment) thr
private String getDataJobNotFoundError(String logs, JobDeployment jobDeployment) {
String error = null;

if (StringUtils.isNotBlank(logs) && logs.contains(">data-job-not-found<")) {
if (StringUtils.isNotBlank(logs)
&& (logs.contains(">data-job-not-found<")
|| logs.contains("failed to get files used from context"))) {
error =
NotificationContent.getErrorBody(
"Tried to deploy a data job and failed.",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,33 @@ public static String getTag(String imageName) {
return tag != null ? tag : "latest";
}

/**
* Extracts the image path from a given container URL.
*
* <p>The method takes a container URL string in the format [hostName]/[imagePath]:[tag] and
* returns the image path. This includes the namespace and image name but excludes the host and
* the tag. If no tag is present in the URL, the method will return the full image path.
*
* @param containerURL A string representing the container URL. Should be in the format
* [hostName]/[imageName]:[tag].
* @return A string representing the image path (i.e., namespace and image name) from the
* container URL.
*/
public static String getImagePath(String containerURL) {
String imageName = containerURL;

var firstSlash = containerURL.indexOf('/');
if (firstSlash >= 0) {
imageName = containerURL.substring(firstSlash + 1, containerURL.length());
}

var lastColon = imageName.lastIndexOf(':');
if (lastColon >= 0) {
imageName = imageName.substring(0, lastColon);
}
return imageName;
}

/**
* Updates the image name with the new tag
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,18 @@

package com.vmware.taurus.service.deploy;

import com.amazonaws.AmazonClientException;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.auth.BasicSessionCredentials;
import com.amazonaws.services.ecr.AmazonECR;
import com.amazonaws.services.ecr.AmazonECRClientBuilder;
import com.amazonaws.services.ecr.model.DescribeImagesRequest;
import com.amazonaws.services.ecr.model.DescribeImagesResult;
import com.amazonaws.services.ecr.model.ImageIdentifier;
import com.amazonaws.services.ecr.model.ImageNotFoundException;
import com.amazonaws.services.ecr.model.RepositoryNotFoundException;
import com.amazonaws.services.ecr.model.*;
import com.vmware.taurus.exception.ExternalSystemError;
import com.vmware.taurus.service.credentials.AWSCredentialsService;
import com.vmware.taurus.service.credentials.AWSCredentialsService.AWSCredentialsDTO;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.lang3.StringUtils;
import org.springframework.stereotype.Service;

/**
Expand All @@ -30,7 +29,7 @@ public class EcrRegistryInterface {

public AWSStaticCredentialsProvider createStaticCredentialsProvider(
AWSCredentialsDTO awsCredentialsDTO) {
if (!awsCredentialsDTO.awsSessionToken().isBlank()) {
if (!StringUtils.isBlank(awsCredentialsDTO.awsSessionToken())) {
// need to include session token
return new AWSStaticCredentialsProvider(
new BasicSessionCredentials(
Expand All @@ -47,7 +46,7 @@ public AWSStaticCredentialsProvider createStaticCredentialsProvider(

public String extractImageRepositoryTag(String imageName) {
// imageName is a string of the sort:
// 850879199482.dkr.ecr.us-west-2.amazonaws.com/sc/dp/job-name:hash
// 850879199482.dkr.ecr.us-west-2.amazonaws.com/sc/dp/job-name:hash'
return imageName.split("amazonaws.com/")[1];
}

Expand All @@ -74,6 +73,27 @@ private DescribeImagesRequest buildDescribeImagesRequest(String imageName) {
.withImageIds(imageIdentifier);
}

private static boolean existsRepository(AmazonECR ecrClient, String repositoryName) {
try {
ecrClient.describeRepositories(
new DescribeRepositoriesRequest().withRepositoryNames(repositoryName));
return true;
} catch (RepositoryNotFoundException e) {
log.debug("Repository does not exist: {}", repositoryName);
return false;
} catch (Exception e) {
log.warn("Failed to check if image exists and will assume it doesn't exist. Exception: " + e);
return false;
}
}

/**
* Checks if a specific image exists in the Amazon ECR.
*
* @param imageName the name of the image whose existence is to be checked
* @param awsCredentialsDTO the DTO containing AWS credentials information
* @return true if the specified image exists, false otherwise
*/
public boolean checkEcrImageExists(
String imageName, AWSCredentialsService.AWSCredentialsDTO awsCredentialsDTO) {

Expand All @@ -86,10 +106,48 @@ public boolean checkEcrImageExists(
imageExists = true;
}
} catch (ImageNotFoundException | RepositoryNotFoundException e) {
log.info("Could not find image due to: {}", e);
log.info("Could not find image due to " + e);
} catch (Exception e) {
log.error("Failed to check if image exists due to: ", e);
log.warn("Failed to check if image exists and will assume it doesn't exist. Exception: " + e);
}
return imageExists;
}

/**
* Creates a repository in Amazon ECR with the provided repository name. If a repository with the
* same name already exists, then nothing happens and operation succeeds.
*
* @param repositoryName the name of the repository to be created. This is without the registry
* part of URI: e.g. if full URL is
* aws_account_id.dkr.ecr.us-west-2.amazonaws.com/my-ns/my-repository:tag , the repository
* name is "my-ns/my-repository"
* @param awsCredentialsDTO the DTO containing AWS credentials information
* @throws ExternalSystemError if other exception occurs during repository creation with container
* registry
*/
public void createRepository(
String repositoryName, AWSCredentialsService.AWSCredentialsDTO awsCredentialsDTO) {
AmazonECR ecrClient = buildAmazonEcrClient(awsCredentialsDTO);

try {

if (!existsRepository(ecrClient, repositoryName)) {
log.debug("Create ECR repository {}", repositoryName);
CreateRepositoryRequest createRepositoryRequest =
new CreateRepositoryRequest().withRepositoryName(repositoryName);

CreateRepositoryResult createRepositoryResult =
ecrClient.createRepository(createRepositoryRequest);

String repositoryUri = createRepositoryResult.getRepository().getRepositoryUri();
log.debug("ECR repository created: {}", repositoryUri);
}

} catch (AmazonClientException e) {
throw new ExternalSystemError(
ExternalSystemError.MainExternalSystem.CONTAINER_REGISTRY,
"Creating container repository " + repositoryName + " failed.",
e);
}
}
}
Loading

0 comments on commit 2946b23

Please sign in to comment.