Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newest Data Job Base Images are Broken #2391

Closed
doks5 opened this issue Jul 11, 2023 · 2 comments · Fixed by #2429
Closed

Newest Data Job Base Images are Broken #2391

doks5 opened this issue Jul 11, 2023 · 2 comments · Fixed by #2429
Assignees
Labels
bug Something isn't working ready The tickets that are ready for the team to take and work on.

Comments

@doks5
Copy link
Contributor

doks5 commented Jul 11, 2023

Describe the bug
Latest data job base images do not work properly due to ssl errors caused by the underlying OS packages. Building data job images succeeds, but subsequently job executions fail with User Errors, similar to the one below:

"An exception occurred, exception message was: An error in data job code  occurred. The error should be resolved by User. Here are the details:\n  WHAT HAPPENED : Failed loading job sources of 10_python_step.py\nWHY IT HAPPENED : An exception occurred, exception message was: No module named '<some-package-name>'\n   CONSEQUENCES : The provided Data Job will not be executed. Terminating application.\nCOUNTERMEASURES : See contents of the exception and fix the problem that causes it. Most likely importing a dependency or data job step failed, see logs for details and fix the failed step (details in stacktrace).",

There are some errors in the job builder logs, but these do not seem to stop the process of deployment, and job images are built and pushed successfully to the registry. Sample error logs:

Couldn't eval /usr/lib/libcrypto.so.3 with link /usr/lib/libcrypto.so.3

Couldn't eval /usr/lib/libssl.so.3 with link /usr/lib/libssl.so.3

Couldn't eval /usr/lib/libssl.so.1.1 with link /usr/lib/libssl.so.1.1

Couldn't eval /usr/lib/libcrypto.so.1.1 with link /usr/lib/libcrypto.so.1.1

Steps To Reproduce
Steps to reproduce the behavior:

  1. Set the values.yaml of the Control Service to use data-job-base-python-3.9:latest for example.
  2. Deploy an instance of the Control Service to allow data jobs to be deployed in a k8s cluster.
  3. Deploy a test data job
  4. Monitor the job builder pods for error logs.
  5. After the job is deployed, execute it.
  6. Observe User Error for Module Not Found

Expected behavior
The data job executes successfully without errors.

Version (please complete the following information):

  • OS: Alpine/Debian (job builder OS images)
  • Version (Not dependent on VDK version)

Additional context
We managed to workaround the error by using version 1.832469684 of the job base images (e.g., data-job-base-python-3.9:1.832469684)

An observation we made was that the newest job base images use python version 3.9.17, which was released on July 6, https://hub.docker.com/layers/library/python/3.9.17/images/sha256-46d99870b9c25e64c5583a59fec411df04191ab6b14cf81a72b9e4a20a78b659?context=explore

@doks5 doks5 added the bug Something isn't working label Jul 11, 2023
@antoniivanov
Copy link
Collaborator

Looking at the differences between the two python. base images , the only changes are in PYTHON_VERSION and PYTHON_PIP_VERSION.
OLD: data-job-base-python-3.11/1.832469684
NEW: data-job-base-python-3.11/1.927053752

PYTHON_VERSION=3.11.4 vs PYTHON_VERSION=3.11.3
PYTHON_PIP_VERSION=23.1.2 vs ENV PYTHON_PIP_VERSION=22.3.1

@antoniivanov antoniivanov self-assigned this Jul 12, 2023
@sabadzhiev sabadzhiev added the ready The tickets that are ready for the team to take and work on. label Jul 13, 2023
@antoniivanov
Copy link
Collaborator

Similar issues: GoogleContainerTools/kaniko#1395 and GoogleContainerTools/kaniko#1045

As far as I can see this issue is caused by trying to use separate image and copying kaniko executor into it.

The way we are preparing our job-builder image is like this:

FROM gcr.io/kaniko-project/executor

FROM alpine

COPY --from=0 /kaniko /kaniko

This is apparently known issue : https://github.com/GoogleContainerTools/kaniko#known-issues

Running kaniko in any Docker image other than the official kaniko image is not supported (ie YMMV).
This includes copying the kaniko executables from the official image into another image.

So instead making sure we built from kaniko image directly should fix the issue

antoniivanov added a commit that referenced this issue Jul 19, 2023
The way we are preparing our job-builder image is like this:

```
FROM gcr.io/kaniko-project/executor

FROM alpine

COPY --from=0 /kaniko /kaniko
```

This is apparently known issue :
https://github.com/GoogleContainerTools/kaniko#known-issues

> Running kaniko in any Docker image other than the official kaniko
image is not supported (ie YMMV).
> This includes copying the kaniko executables from the official image
into another image.

So instead making sure we built from kaniko image directly should fix
the issue

See #2391
antoniivanov added a commit that referenced this issue Jul 19, 2023
The way we are preparing our job-builder image is like this:

```
FROM gcr.io/kaniko-project/executor

FROM alpine

COPY --from=0 /kaniko /kaniko
```

This is apparently known issue :
https://github.com/GoogleContainerTools/kaniko#known-issues

> Running kaniko in any Docker image other than the official kaniko
image is not supported (ie YMMV).
> This includes copying the kaniko executables from the official image
into another image.

So instead making sure we built from kaniko image directly should fix
the issue

See #2391
antoniivanov added a commit that referenced this issue Jul 19, 2023
The way we are preparing our job-builder image is like this:

```
FROM gcr.io/kaniko-project/executor

FROM alpine

COPY --from=0 /kaniko /kaniko
```

This is apparently known issue :
https://github.com/GoogleContainerTools/kaniko#known-issues

> Running kaniko in any Docker image other than the official kaniko
image is not supported (ie YMMV).
> This includes copying the kaniko executables from the official image
into another image.

So instead making sure we built from kaniko image directly should fix
the issue

See #2391

Google Java Format

control-service: job-builder uisng kaniko fix
antoniivanov added a commit that referenced this issue Jul 20, 2023
The way we are preparing our job-builder image is like this:

```
FROM gcr.io/kaniko-project/executor

FROM alpine

COPY --from=0 /kaniko /kaniko
```

This is apparently known issue :
https://github.com/GoogleContainerTools/kaniko#known-issues

> Running kaniko in any Docker image other than the official kaniko
image is not supported (ie YMMV).
> This includes copying the kaniko executables from the official image
into another image.

So instead making sure we built from kaniko image directly should fix
the issue

See #2391

Google Java Format

control-service: job-builder uisng kaniko fix
antoniivanov added a commit that referenced this issue Jul 20, 2023
The way we are preparing our job-builder image is like this:

```
FROM gcr.io/kaniko-project/executor

FROM alpine

COPY --from=0 /kaniko /kaniko
```

This is apparently known issue :
https://github.com/GoogleContainerTools/kaniko#known-issues

> Running kaniko in any Docker image other than the official kaniko
image is not supported (ie YMMV).
> This includes copying the kaniko executables from the official image
into another image.

So instead making sure we built from kaniko image directly should fix
the issue

See #2391

Google Java Format

control-service: job-builder uisng kaniko fix
antoniivanov added a commit that referenced this issue Jul 20, 2023
The way we are preparing our job-builder image is like this:

```
FROM gcr.io/kaniko-project/executor

FROM alpine

COPY --from=0 /kaniko /kaniko
```

This is apparently known issue :
https://github.com/GoogleContainerTools/kaniko#known-issues

> Running kaniko in any Docker image other than the official kaniko
image is not supported (ie YMMV).
> This includes copying the kaniko executables from the official image
into another image.

So instead making sure we built from kaniko image directly should fix
the issue

See #2391

Google Java Format

control-service: job-builder uisng kaniko fix
antoniivanov added a commit that referenced this issue Jul 20, 2023
The way we are preparing our job-builder image is like this:

```
FROM gcr.io/kaniko-project/executor

FROM alpine

COPY --from=0 /kaniko /kaniko
```

This is apparently known issue :
https://github.com/GoogleContainerTools/kaniko#known-issues

> Running kaniko in any Docker image other than the official kaniko
image is not supported (ie YMMV).
> This includes copying the kaniko executables from the official image
into another image.

So instead making sure we built from kaniko image directly should fix
the issue

See #2391

Google Java Format

control-service: job-builder uisng kaniko fix
@antoniivanov antoniivanov linked a pull request Jul 20, 2023 that will close this issue
antoniivanov added a commit that referenced this issue Jul 22, 2023
The way we are preparing our job-builder image is like this:

```
FROM gcr.io/kaniko-project/executor

FROM alpine

COPY --from=0 /kaniko /kaniko
```

This is apparently known issue and caused this outage described in #2391
 
https://github.com/GoogleContainerTools/kaniko#known-issues

> Running kaniko in any Docker image other than the official kaniko
image is not supported (ie YMMV).
> This includes copying the kaniko executables from the official image
into another image.

So instead making sure we built from kaniko image directly should fix
the issue

See #2391

Testing Done: Beyond automated tests I also tested it in one of
environments where the issue in #2391 reproduce and verified with the
new image the jobs are correctly built.

---------

Co-authored-by: github-actions <>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
antoniivanov added a commit that referenced this issue Jul 25, 2023
Upon `vdk deploy` the Control Service is responsible for installing all
external dependecies defined in requirements.txt of the data job so they
can be used during a cloud run.

A recent outage in one of our customers caused data jobs to ignore their
requirements.txt and not install any external libraries. It was trace to
be due to a release of [data job base

image](https://hub.docker.com/layers/versatiledatakit/data-job-base-python-3.11)

CICD of the open source VDK did not catch this. 

I am extending VDK heartbeat to test for the issue so we can catch it in
the future.
 
See also #2391
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready The tickets that are ready for the team to take and work on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants