Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buildpack Dependency Management Improvements #8

Closed
ryanmoran opened this issue Apr 20, 2022 · 14 comments
Closed

Buildpack Dependency Management Improvements #8

ryanmoran opened this issue Apr 20, 2022 · 14 comments
Assignees

Comments

@ryanmoran
Copy link
Member

ryanmoran commented Apr 20, 2022

Summary

Many of the Paketo buildpacks contain references to dependencies that they will install during their build phase. These dependencies are often language runtimes like Ruby MRI or package managers like Poetry. The dependencies are tracked and built from their upstream source (dep-server) and updated in buildpacks (jam update-dependencies and dependency/update action) through a considerable amount of automation. This current architecture has outlived its utility and will likely present a significant technical headwind as we attempt to move buildpacks to new stacks.

Outcome

This exploration should focus on providing direction for a future effort to modernize the dependency-building infrastructure we depend upon in Paketo Buildpacks. In the process, this exploration should weigh the following goals and any others that may be discovered in the exploration process with the result being an RFC outlining a future direction for dependencies.

Goals

Remove Cloud Foundry Dependency

The dependency-building automation is tightly coupled to the legacy dependency-building infrastructure inherited from Cloud Foundry (buildpacks-ci and binary-builder). Making changes to these codebases to support new Paketo use cases and features is a convoluted and difficult process. Ideally, we could move, refactor, or rewrite this code into codebases that we maintain within the Paketo Buildpacks project.

Use Upstream References

Many of our dependencies may already be built in a form that is usable on top of our stacks. In these cases, we shouldn't be re-building them to no real benefit. Instead, we should just reference the upstream artifact download location. An example of this might be Go. The Go downloads page serves pre-built tarballs for Linux on a number of architectures. Ideally, we would just be able to reference these download URLs in our go-dist buildpack.toml file.

Adopt Federated Model

The current dependency-building automation is mostly centralized in the dep-server repository. While this was good when we had a dedicated team of folks with a strong working knowledge of these components, it has become more difficult to maintain with a more diffuse ownership model. It would likely be more advantageous for us to move much of the monitoring and building infrastructure into the repositories where these dependencies will ultimately reside. For instance, it would make sense to have the dependency-building infrastructure for the Node.js runtime within the node-engine repository and directly under the responsibility of the Node.js Maintainers team. It still would make sense for a Dependencies team to help maintain expertise and tooling for the workflows involved in dependency-building generally, but the particulars of each dependency could be distributed to their respective buildpack repository.

Consolidate with Java Workflows

The dependency-building infrastructure described above does not encompass any of the dependencies that contribute to the Java buildpacks. The Java buildpacks have their own system for managing dependencies. It is worth considering what a consolidation of these systems might look like.

Enable Multi-stack / Multi-architecture Support

The dependency-building infrastructure is tightly coupled to an Ubuntu Bionic-derived stack on a Linux AMD64 architecture. Ideally, we would propose a solution that would enable us to deliver dependencies on a more diverse set of operating system / architecture pairings.

@ryanmoran ryanmoran moved this to 🕵️‍♀️ Exploring in Paketo Workstreams Apr 20, 2022
@dmikusa
Copy link

dmikusa commented Apr 20, 2022

The dependency-building infrastructure described above does not encompass any of the dependencies that contribute to the Java buildpacks. The Java buildpacks have their own system for managing dependencies. It is worth considering what a consolidation of these system might look like.

+1 - I was just talking to @ForestEckhardt about how we can do some consolidations in pipelines. I'd definitely be interested in arriving on a singular way to handle dependencies across all the Paketo buildpacks.

I'd also be happy to share our experiences using a more federated approach and pulling dependencies directly from upstream locations. There are some good parts and some challenges as well.

@fg-j
Copy link

fg-j commented Apr 20, 2022

@dmikusa-pivotal I'd like to hear about your experience with the federated approach, for sure.

@ForestEckhardt
Copy link
Contributor

I have a couple of questions/comments:

  1. Would moving to the work flow using already hosted dependencies mean that we would stop trying to gather other information about these dependencies such as the purl, cpe, and license or would we still want to generate that information for our the buildpack.toml?
  2. There are certain dependencies that we repackage for size reasons. I am think mostly of the .Net dependency which we do some minor pruning on and also repackage using xz compression. By switching to the Microsoft hosted dependencies we would no longer be getting that compression and our dependencies would grow in size.
  3. There are some dependencies that we currently restructure to ensure that they can be decompressed directly into a layer without any further manipulation. This often is accomplished by stripping a top level directory off of the packaged artifact. This is something that packit is capable just wanted to note it as a thing.

Overall I am excited by the prospect of being able to eliminate some of the magic and some of the back breaking work and I think that it would make it easier for us to get new languages from community members. I think that it will also ultimately make our buildpacks make more sense because we are downloading and installing the same dependency that your average developer is using. I think that this will make make it easier for users that need to replace the dependency with one hosted on a mirror as well because the artifact is question will not be any different from the publicly available one .

@ryanmoran ryanmoran moved this from 🕵️‍♀️ Exploring to ❓Not scoped in Paketo Workstreams Apr 21, 2022
@dmikusa
Copy link

dmikusa commented Apr 21, 2022

@dmikusa-pivotal I'd like to hear about your experience with the federated approach, for sure.

Some notes off the top of my head:

  1. You are dependent on the 3rd party's CDN for performance. Java pulls a number of dependencies from Github. The Github CDN is great in the US. It's not great in other parts of the world. I've gotten reports that it takes users upwards of five minutes to download something that takes me 20s to download. As the buildpack team, we have little recourse for issues like this. (As a side note, this is why caching and alternative download repo support are high on the priority list for the Java team)

  2. Detecting dependencies updates is a pain. We're managing like 20 different actions to check and fetch dependencies. In good cases, there's an API we can use to check for new versions & fetch downloads. In not good cases, we're basically screen scraping data and links, which is fragile and breaks occasionally. Even downloading Github resources can be a pain because projects structure their Github repos/releases/tags/assets in slightly different ways.

  3. You have to be kind to the 3rd party CDN you're targeting. Many of them are OSS and we don't want to generate a lot of traffic, so we have to keep polling for updates to less frequent intervals (like daily). We also have to think about what pointing lots of buildpack users to someone's CDN might do. Buildpacks may need to download resources multiple times, that can consume more bandwidth and skew download metrics.

  4. It doesn't really work if you need to get resources that are behind a login. Even if we're allowed to redistribute them, if the vendor requires you to login first, that doesn't work because essentially the end-user will need to login to download the resource.

  5. Most 3rd parties are not publishing sha256 hashes for downloads. Many don't publish any hash, but some will use just a sha1 hash. This means that we have to download some unknown resource, calculate the sha256 hash and use that. It's not technically hard, but it voids some of the usefulness of the sha256 hash.

  6. Speaking of hashes, some 3rd parties will change their downloads for a published release and not bump the version number so all of a sudden the hash will just stop matching and the buildpack will break. It then requires us to investigate and see what happened, which isn't always easy to determine. Then if we trust the change, it requires us to update buildpack.toml and publish a new buildpack version.

  7. Like @ForestEckhardt mentioned, you get whatever the 3rd party publishes. If they have a weird folder structure or include a bunch of stuff you don't need, that all get's installed. This hasn't been a huge issue for us, but we do occasionally have to strip a top level directory off an archive or move some binaries into a bin/ directory as part of the install process in the buildpack. Probably what's more challenging here is the upcoming ARM64 work. It's getting more common, but not everyone is publishing ARM64 binaries at the moment. If the project doesn't have them, then you're kind of stuck.

  8. I think 7 raises up a question though as to should we be stripping things out? or should we be giving users stock downloads? We've largely taken the approach of providing stock downloads (what you get if you as a user go and download the resource), but there have been cases where users have asked us to prune things. I can understand both sides of the argument.

Would moving to the work flow using already hosted dependencies mean that we would stop trying to gather other information about these dependencies such as the purl, cpe, and license or would we still want to generate that information for our the buildpack.toml?

On the Java buildpacks, we have this information in buildpack.toml and update it when we update releases. The base information tends not to change, but we have to keep the versions in these fields all in sync. How is this being sourced with deps-server?

@ForestEckhardt
Copy link
Contributor

On the Java buildpacks, we have this information in buildpack.toml and update it when we update releases. The base information tends not to change, but we have to keep the versions in these fields all in sync. How is this being sourced with deps-server?

It is being generated when it is input into the system so it is possible I was just curious if this also meant we would be stripping down the data we are providing for each dependency. For the most part many of our buildpacks use this to construct the old SBOM format and it may not make sense to have it there in the long run if we are using Syft.

@ryanmoran
Copy link
Member Author

Would moving to the work flow using already hosted dependencies mean that we would stop trying to gather other information about these dependencies such as the purl, cpe, and license or would we still want to generate that information for our the buildpack.toml?

@ForestEckhardt, yes. We would still want this information. So, a solution would need to take this into account.

@ryanmoran
Copy link
Member Author

ryanmoran commented Apr 21, 2022

For the most part many of our buildpacks use this to construct the old SBOM format and it may not make sense to have it there in the long run if we are using Syft.

It is also used to generate the new SBOM format: https://github.com/paketo-buildpacks/packit/blob/2247967a3f873b178f6fb16c5e6411646ca0882a/sbom/sbom.go#L72-L97

@ForestEckhardt
Copy link
Contributor

I stand corrected

@sophiewigmore
Copy link
Member

@dmikusa-pivotal thanks for outlining those cases it's super helpful. I'd say that item 2 Detecting dependencies updates is a pain is true regardless of what dependency management approach you take. The dep-server has similar issues anyways so I'm not too worried about that.

Out of the items you mentioned, number 6 around hashes changing is the most concerning to me. Whatever process we implement, I think it'll be really important to have a way to reconcile mismatched SHAs or detect changes.

Number 7 and 8 around modifications to the dependency are also pretty complicated, but I think moving to the federated approach will be a big help with this, since we can delegate out those types of decisions to language family maintainers potentially.

@dmikusa
Copy link

dmikusa commented Apr 25, 2022

👍

One other thought that's been a hindrance for us. Github Actions are not well suited to checking for dependency updates. There is no trigger or event, even if the 3rd party is using Github to release code, so you end up having to poll for updates.

Presently, we're polling daily, because if we do it more often we'll blow past the limits Github Actions puts on the execution of our jobs. In some cases, this means we have to manually trigger the job like if we need to get an urgent update released. It's not a big deal and it's easy to do, but it's manual work.

Also, if you have a buildpack that has many dependencies then you run into an issue with how to organize them. The Liberty buildpack, for example, has quite a few dependencies that we monitor. We presently have them set up such that each dependency has its own workflow. The workflows are largely the same but just check for different resources. This has some advantages in that it's easy to have them all run in parallel, if one fails it doesn't impact others, and it's easy to trigger just a single resource if you need to force an update or re-run a failed update. It's not nice in that the parallelization makes us hit Github Action limits faster, there's lots of duplication across workflows, and it's extremely inefficient (Github Actions spins up a new VM for each workflow & job).

Personally, I'd like them to be more efficient. I've thought about how we could merge them all into a single workflow and job with multiple steps, but then you don't get the same parallelization and it's not easy to run/re-run a specific update. It's also not clear if that would help reduce duplication in the workflow, possibly, I haven't looked from that angle.

I've also thought about moving this type of fetch outside of Github Actions, somewhere it can be done more efficiently and then using hooks/API to trigger Github Actions or submitting a PR directly. That's a big step though and we haven't had time to investigate it further.

@thitch97 thitch97 moved this from ❓Not scoped to 📝 Todo in Paketo Workstreams Apr 25, 2022
@sophiewigmore sophiewigmore moved this from 📝 Todo to 🚧 In Progress in Paketo Workstreams May 9, 2022
@garethjevans
Copy link

We don't have the luxury of using GitHub Actions internally so have built out a dependency update system in Concourse with a few custom Concourse Resources

Re Point 5: There are a lot of dependencies that don't provide sha256sums but do provide others, if the sha256 exists we download and use that, if it doesn't, we calculate a new one against the downloaded binary, but we also verify the downloaded binary against some of the other shas that are available. This gives us an extra bit of confidence that the binary is what we were expecting it to be.

Re Semver: It would be great if everything supported semver compatible version numbers - but they don't. We've added in quite a bit of logic around handling non-semver compatible version numbers.

@robdimsdale
Copy link
Member

robdimsdale commented May 16, 2022

One possible additional goal that we could consider is making the dep server itself a system that others can re-use - either in part or as a whole. Although this would potentially increase the support burden on the dependencies team, I think it could also result in a significant value add for other buildpack authors outside of the paketo core team.

I have three cases in mind where buildpack authors outside of the paketo team could benefit from a reusable (and federated) dep-server:

  1. Third-party OSS teams maintaining their own buildpacks with github infrastructure - they could standup their own dep server (e.g. api.deps.some-oss-team.com and use github actions to build their dependencies.
  2. Third-party commercial teams maintaining their own buildpacks with private github infrastructure. I don't know for sure how github actions works in a private model, but assuming it works similarly to the OSS actions, this would be identical to the above
  3. Third-part commercial teams who cannot use github actions. They would potentially be able to stand-up a dep server to gain its API but would have to use an alternative system (e.g. concourse) to run the github scripts.

@fg-j fg-j moved this to 🚧 In Progress in Paketo Roadmap 2022 Jun 7, 2022
@dmikusa
Copy link

dmikusa commented Jul 21, 2022

I wanted to also add this proposal as well: https://docs.google.com/document/d/1g5rRW-oE_v8Gdvz-CiCOK9z2rxg6L5XniKI25Zq2j6M/edit

I believe it to be complementary to what's been discussed already. You can take a look at the google doc for now, but I hope to get this into an RFC format in the near future.

@ryanmoran
Copy link
Member Author

As the RFCs for dependency management have been approved and merged, and work is already underway to implement them, I will close this issue.

Repository owner moved this from 📨 PR Opened to ✅ Done in Paketo Workstreams Sep 23, 2022
Repository owner moved this from 🚧 In Progress to ✅ Done in Paketo Roadmap 2022 Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants