Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: synth of CustomResourceProvider hangs in Docker on Linux 5.6-5.10 #21379

Closed
nburtsev opened this issue Jul 29, 2022 · 5 comments · Fixed by #23076
Closed

core: synth of CustomResourceProvider hangs in Docker on Linux 5.6-5.10 #21379

nburtsev opened this issue Jul 29, 2022 · 5 comments · Fixed by #23076
Labels
@aws-cdk/core Related to core CDK functionality bug This issue is a bug. effort/large Large work item – several weeks of effort p2

Comments

@nburtsev
Copy link
Contributor

nburtsev commented Jul 29, 2022

Describe the bug

After update to 2.34 we noticed that some of our deploys and tests hang in pipelines but work just fine locally.

I was able to localize it to us deploying S3 buckets with autoDeleteObjects: true and this change #20953. the symptoms (endless copy_file_range in strace) look very similar to nodejs/node#40200 that points further to Docker/Kernel bugs.

Simple workaround for this is to set TMP env var to a path inside build working dir (aka git tree) i.e. mkdir tmp && export TMP=$PWD/tmp && cdk deploy

Our pipelines are executed in containers inside OCP4.10.13 (k8s 1.23.5) cluster in AWS with no special configuration, both node14 and node16 behave the same way. I have limited access to EKS 1.22, using the same container image - the problem does not occur, which probably makes sense - kernel versions seem to be different (4.18.0-305.45.1.el8_4.x86_64 vs 5.4.188-104.359.amzn2.x86_64"

Expected Behavior

Stack is deployed

Current Behavior

cdk deploy hangs in endless loop

Reproduction Steps

import { Stack, StackProps, RemovalPolicy, App} from "aws-cdk-lib";
import { Construct } from "constructs";
import * as s3 from "aws-cdk-lib/aws-s3";

export class TestStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    new s3.Bucket(this, "Bucket", {
      autoDeleteObjects: true,
      removalPolicy: RemovalPolicy.DESTROY,
    });
  }
}
const app = new App();

new TestStack(app, "test", {});

Any kind of test that renders this stack will also hang.

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.34.0 (build 633edab)

Framework Version

No response

Node.js Version

14.19.0

OS

Ubuntu 20.04.4

Language

Typescript

Language Version

No response

Other information

No response

@nburtsev nburtsev added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jul 29, 2022
@github-actions github-actions bot added the @aws-cdk/core Related to core CDK functionality label Jul 29, 2022
@rix0rrr
Copy link
Contributor

rix0rrr commented Sep 2, 2022

Wow, this is horrible. Thanks for reporting. If I'm reading the linked issues correctly, it seems that

  • inside Docker
    • copying a file
      • then copying the copy using fs.copyFile
        • will hang

The linked thread in nodejs/node#40200 seems to not have seen movement in ~a year, and it seems the issue there is punted to a combination of Docker/kernel issues.

I will leave this thread open for discussion and tracking, but I'm not too inclined to make any changes currently. I wouldn't even know how or what to do, properly. Special case detection that we're inside Docker, notice which paths are mapped to volumes (if that's even something we can do), and then choose a different copy target?

If we're in an environment where we can't trust the filesystem anymore... I mean... 🤷‍♂️ I give up.

@rix0rrr rix0rrr added effort/large Large work item – several weeks of effort p2 and removed needs-triage This issue or PR still needs to be triaged. labels Sep 2, 2022
@rix0rrr rix0rrr removed their assignment Sep 2, 2022
@rix0rrr
Copy link
Contributor

rix0rrr commented Nov 22, 2022

Also, this only seems to happen on very specific instances of Docker on a particular Linux kernel, right?

@rix0rrr rix0rrr changed the title core: recent change to CustomResourceProvider makes bundling hang inside certain containers core: synth of CustomResourceProvider hangs in Docker on Linux 5.6-5.10 Nov 22, 2022
@rix0rrr
Copy link
Contributor

rix0rrr commented Nov 22, 2022

As discussed here, this affects Linux kernels 5.6.x-5.10.x: https://lore.kernel.org/stable/[email protected]/

Workaround is setting $TMP or upgrading your kernel.

@rix0rrr
Copy link
Contributor

rix0rrr commented Nov 24, 2022

The CDK behavior is as follows:

  • Setting autoDeleteObjects creates a Custom Resource that will clear the bucket on stack deletion.
  • The CDK writes copies files when it needs to generate a code bundle for the Custom Resource provider. This code bundle consists of your code plus an index file we add for you.
  • After these source files are generated, the files are then copied into the cdk.out directory as part of asset staging. This is the same for all assets. The directory these files are copied into depends on the hash of all source files going into it, so the source bundle needs to be complete before this step can start.

The change was:

  • We used to do the first step, copying of source files, inside the node_modules directory. This was actually incorrect, as the node_modules directory should be considered a read-only repository of library code. So we changed the code generation to be moved to the system's temporary directory.
  • From Docker's point of view, in the old situation the file used to be created on a volume mount, but in the new situation is now created in a directory that's fully inside the container's overlayfs file system.
  • (This is why the workaround is moving the $TMP dir back to a location inside a Docker volume mount)

The problem was:

  • Because of a combination of Docker and kernel behavior, the copy second copy operation would appear to copy 0 bytes.
  • The NodeJS copyFile function keeps on retrying the call to copy more and more bytes over, getting 0 every time, and waiting until the copy is complete. This never finishes, and so the build appears to hang.
  • In later kernel versions, this bug has been fixed so the copy operation returns an actual number of bytes instead of 0, allowing the copy to succeed.

Full props to @nburtsev for figuring this out. I'm not sure I myself would have been able to put all of this together.


In summary:

The CDK does not directly communicate with the kernel--we just perform filesystem copies. Bugs in the interaction of other pieces of software cause the file copy to loop endlessly if the right combination of circumstances is hit.

rix0rrr added a commit that referenced this issue Nov 24, 2022
A particular combination of software has hard-to-recover bug.

Add a check and warning for it.

Closes #21379.
@mergify mergify bot closed this as completed in #23076 Dec 1, 2022
mergify bot pushed a commit that referenced this issue Dec 1, 2022
A particular combination of software has hard-to-diagnose bug.

Add a check and warning for it.

Closes #21379.

----
*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

github-actions bot commented Dec 1, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

brennanho pushed a commit to brennanho/aws-cdk that referenced this issue Dec 9, 2022
A particular combination of software has hard-to-diagnose bug.

Add a check and warning for it.

Closes aws#21379.

----
*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
brennanho pushed a commit to brennanho/aws-cdk that referenced this issue Jan 20, 2023
A particular combination of software has hard-to-diagnose bug.

Add a check and warning for it.

Closes aws#21379.

----
*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
brennanho pushed a commit to brennanho/aws-cdk that referenced this issue Feb 22, 2023
A particular combination of software has hard-to-diagnose bug.

Add a check and warning for it.

Closes aws#21379.

----
*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/core Related to core CDK functionality bug This issue is a bug. effort/large Large work item – several weeks of effort p2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants