Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define complete set of integration points #251

Open
howardjohn opened this issue Dec 6, 2022 · 22 comments
Open

Define complete set of integration points #251

howardjohn opened this issue Dec 6, 2022 · 22 comments
Labels
size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Comments

@howardjohn
Copy link
Member

A concern for ztunnel is that it expands to support an unbounded number of integrations, leading to similar issues we have with Envoy. A goal of ztunnel is that it is light and minimal.

I would propose we limit integration points to the following:

  • Metrics: Prometheus and OTLP
  • Access Logs: stdout and OTLP
  • Traces: OTLP (if we implement tracing)
  • CA: Citadel gRPC and Kubernetes CSR (we can actually drop Citadel if Istiod supports CSR, but we might want it for VMs)

type: design

@howardjohn howardjohn added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 6, 2022
@bleggett
Copy link
Contributor

bleggett commented Feb 2, 2023

One thing I think it is important for ambient to continue to support is support for alternative SPIFFE-compliant workload CAs that implement SDS

We already support that in sidecar so parity is good/important, but in general I think having as few opinions as possible in ztunnel about how certs are minted and rotated and what the CA is attesting (or even what CA you happen to be using to do those things - our default or another) is good for constraining ztunnel's complexity.

We could build our own SDS-lite in ztunnel, and obviously we don't want to implement envoy specs for everything, but where it gets us broader compatibility with existing stuff and lets us clearly demarcate what ztunnel will not do and will not implement, I think it can be helpful.

@howardjohn
Copy link
Member Author

Any option would be a tradeoff between supported integrations and complexity. SDS might make the cut. The Kubernetes CSR is a core K8s API intended for exactly that purpose though, so I would be curious if existing SDS implementations could/would implement the K8s CSR API - and if not, their reasoning may be interesting

@howardjohn
Copy link
Member Author

Hmm I guess SDS probably gets us a full cert+priv key, while CSR simply signs them.

@bleggett
Copy link
Contributor

bleggett commented Feb 2, 2023

@howardjohn does ztunnel currently (or do we expect it to in the future) mint workload certs or sign anything related to workloads itself, or does it largely work like sidecars do and simply expect $some_workload_cert to exist and be validatable against some authority?

I would assume not, and that it should be able to remain pretty stupid about workload certs.

@howardjohn
Copy link
Member Author

We have a certificate per workload identity on the node, not just 1 cert

@bleggett
Copy link
Contributor

bleggett commented Feb 2, 2023

We have a certificate per workload identity on the node, not just 1 cert

Right - but ztunnel shouldn't ever need to mint any of those, it should just expect to obtain them, and validate them against some authority, correct.

@howardjohn
Copy link
Member Author

If there was some way to obtain them, yes. Currently there isn't afaik

@bleggett
Copy link
Contributor

bleggett commented Feb 3, 2023

Linking istio/istio#42339 as potentially related - as currently sidecar proxy SDS support is how we integrate SPIRE with Sidecar and presumably we'd need a semi-compatible model for ambient.

@howardjohn
Copy link
Member Author

I will note full compatibility with all Istio features is an anti-goal of ambient; part of the goals are to ditch all the baggage accumulated over the years. The goal of this issue is to find the minimum set of integration points we can have. So while I think SDS is plausible, sidecar support is not an important factor in decisions IMO. For example, we would not add lightstep support or something (instead using Otel)

@bleggett
Copy link
Contributor

bleggett commented Feb 3, 2023

I will note full compatibility with all Istio features is an anti-goal of ambient; part of the goals are to ditch all the baggage accumulated over the years. The goal of this issue is to find the minimum set of integration points we can have. So while I think SDS is plausible, sidecar support is not an important factor in decisions IMO. For example, we would not add lightstep support or something (instead using Otel)

Agreed, and that's certainly fair - but in general keeping ztunnel as cleanly separated from the workload CA as possible in a way that supports people bringing their own seems like a reasonable thing to consider, as a clarifying function for what ztunnel should not care about.

@kyessenov
Copy link

I think it's a nice aspiration to limit the integrations, but the reality is that the world evolves, and the standards come and go (not to mention it'll be ages before current ones are fully adopted). I'd suggest instead to focus on clear separations in #353, and support integrations with extensions.

@bleggett
Copy link
Contributor

bleggett commented Feb 3, 2023

I think it's a nice aspiration to limit the integrations, but the reality is that the world evolves, and the standards come and go (not to mention it'll be ages before current ones are fully adopted). I'd suggest instead to focus on clear separations in #353, and support integrations with extensions.

That's an avenue as well, agreed. At the end of the day, I think it's important to have a requirement that ztunnel should be OK with you bringing your own SPIFFE-compliant workload CA instead of forcing people to use the one Istio ships - at this point I don't have a strong opinion on how that should be implemented.

@bleggett
Copy link
Contributor

bleggett commented Feb 8, 2023

I think @kyessenov brings up a good point wrt "integration points" here in #353 - I think there might be two categories of things we are discussing here:

  1. Documented, end-user facing integrations - "bring your own X and we will play with it" - these are the kinds of things John is mentioning - the list John has here makes sense, and I agree these need to be limited and formalized, as we will have to document and support them here in this repo -> we (this repo) bear the cost.

  2. Internal interfaces/extension points that make it easier to hack on ztunnel proper, but which we don't make stability, compatibility, or functionality guarantees about and which we don't support as user-facing - if you want to use one of these you either need to fork the repo (and bear that cost yourself) or wrap ztunnel with your own code (and bear the cost of keeping in sync with upstream) -> you (entity hacking on ztunnel) bear the cost.

I think we will have to have both of these - I think John's list seems like a very reasonable upper bound for category no. 1 at this time, and I think things like e.g. #353 make sense for category no. 2 and what @kyessenov is pointing out - internal interfaces and abstraction points that make it easier for people to work on and add self-maintained extensions to the code (possibly for future upstreaming, possibly not), but which don't meet the bar of "this repo will support this as a global integration point for end-user BYOS"

As far as for constraining category no. 2, I think the only practical/effective way to do that in the long run is to keep ztunnel's sphere of functional responsibility and feature set limited.

@kyessenov
Copy link

My point is that limiting (1) is basically impossible, any software will have to majorly adapt to evolving standards in the cloud infrastructure. Even OpenTelemetry is rapidly changing, so it's not clear which version of OpenTelemetry is referenced here. (2) will help re-disk, because it'll limit the impact of change. SPIRE/SPIFFE will also change, and SDS is not really a standard (even Google doesn't always use it for mTLS).

The best way to preserve compatibility is via protocols, so data (HBONE) and config specifications (xDS) is what really would help software evolution.

@bleggett
Copy link
Contributor

bleggett commented Feb 8, 2023

My point is that limiting (1) is basically impossible, any software will have to majorly adapt to evolving standards in the cloud infrastructure. Even OpenTelemetry is rapidly changing, so it's not clear which version of OpenTelemetry is referenced here. (2) will help re-disk, because it'll limit the impact of change. SPIRE/SPIFFE will also change, and SDS is not really a standard (even Google doesn't always use it for mTLS).

Agree - limiting either (1) or (2) is not entirely possible in the cosmic sense, but being slow/parsimonious with what we agree to support with (1) is something we do need to do, and to your point having (2) de-risks/allows us to do that.

The best way to preserve compatibility is via protocols, so data (HBONE) and config specifications (xDS) is what really would help software evolution.

I agree with that up to the point where we invent our own protocol and specification instead of using/tracking someone else's open spec - OTel is an example of that, it's an open spec + wire standard (the only one, really) with broad support we should track and follow, and not care much about - there's very little value in supporting anything else at this point.

SDS is just a de-facto standard yeah, but the SDS pattern of keeping secrets and CAs (and what those CAs attest about the workloads they are minting certs for) formally separate from proxy concerns is one we will need to follow, regardless of the mechanism.

@kyessenov
Copy link

I agree with that up to the point where we invent our own protocol and specification instead of using/tracking someone else's open spec - OTel is an example of that, it's an open spec + wire standard (the only one, really) with broad support we should track and follow, and not care much about - there's very little value in supporting anything else at this point.

Yes, I didn't mean we invent custom protocols. Exactly the opposite - we use the standards like MASQUE, xDS data plane config, OTel protocols. Integrations happen on the other side of these protocols - if you want custom telemetry, use an OTel collector; want some fancy routing, modify xDS, etc. Custom secret distribution may be in the domain of xDS (SDS). We should be very clear that HBONE is not a new protocol, it's an adaptation of MASQUE.

The main question raised earlier was about wire integrations, e.g. Wasm modules. If we follow the above principle, the right architecture is to chain traffic using a standard protocol to a different process. This can be a waypoint host or a different runtime that speaks HBONE. Waypoint is not just for HTTP processing, it is really the integration mechanism for the traffic analysis that is potentially unsafe or costly.

@bleggett
Copy link
Contributor

bleggett commented Feb 9, 2023

Waypoint is not just for HTTP processing, it is really the integration mechanism for the traffic analysis that is potentially unsafe or costly.

I think there are some cases this does not cover - where a waypoint roundtrip is not acceptable, or we have no guarantee that a waypoint exists - but agree in general, especially for anything above L4/simple TCP packet routing.

@SpectralHiss
Copy link

SpectralHiss commented Apr 24, 2023

Continued support for Citadel gRPC (istio-csr API) in ztunnel would allow us to drop-in the istio-csr cert-manager project for ambient as well. is it too much of a pain to move the citadel istio-agent code to rust? we can try and provide assistance on this front.
If not mistaken it seems from this issue that only Kubernetes CSR is currently definitely marked for inclusion in future releases?

@howardjohn
Copy link
Member Author

I don't think there is agreement on dropping Citadel yet. This is also currently the only mode actually implemented, btw.

@dhruvmehtaaa
Copy link

Hello Everyone,
My name is Dhruv Mehta and I am currently a student at IIT Bombay, India. I would like to work on this issue in the upcoming LFX term.

@kfaseela
Copy link
Member

kfaseela commented Feb 6, 2024

@dhruvmehtaaa We are still finalizing on the LFX mentee selection process, and once done, will reach out to the selected person.

@kfox1111
Copy link

I would think, choosing a mechanism that outsources the key/cert/ca config to an outside process would be much more compatible with existing and future software, then a limited subset of functionality like passing a csr to another process. The key/cert/ca api would allow processes to be plugged in for kubernetes and spire, while just a csr excludes spire support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Status: No status
Status: No status
Development

No branches or pull requests

7 participants