Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(cri): restructure CRI API (improve robustness, clarity and maintainability) #1600

Merged
merged 7 commits into from
Mar 7, 2024

Conversation

incertum
Copy link
Contributor

@incertum incertum commented Dec 21, 2023

What type of PR is this?

Uncomment one (or more) /kind <> lines:

/kind bug

/kind cleanup

/kind design

/kind documentation

/kind failing-test

/kind feature

Any specific area of the project related to this PR?

Uncomment one (or more) /area <> lines:

/area API-version

/area build

/area CI

/area driver-kmod

/area driver-bpf

/area driver-modern-bpf

/area libscap-engine-bpf

/area libscap-engine-gvisor

/area libscap-engine-kmod

/area libscap-engine-modern-bpf

/area libscap-engine-nodriver

/area libscap-engine-noop

/area libscap-engine-source-plugin

/area libscap-engine-savefile

/area libscap

/area libpman

/area libsinsp

/area tests

/area proposals

Does this PR require a change in the driver versions?

/version driver-API-version-major

/version driver-API-version-minor

/version driver-API-version-patch

/version driver-SCHEMA-version-major

/version driver-SCHEMA-version-minor

/version driver-SCHEMA-version-patch

What this PR does / why we need it:

I have added very detailed comments to the code diff re the implemented changes.

At a high level the refactor achieves the following:

  • Add robustness wrt to metadata retrieved from the PodSandboxStatusResponse
  • Improve technical clarity and maintainability (flow and steps are now easy to follow and understand)
  • Remove some redundancies
  • Expand unit tests (further improving technical clarity and maintainability)
  • This PR does NOT add any extra API calls that weren't done prior to this refactor

Which issue(s) this PR fixes:

#1589

Why this PR matters:

  • Currently, we are not handling or enriching pod sandbox container processes correctly. This means adopters can falsely report that a container image is missing, when in fact it may be legitimate. Once this PR is merged, adopters will have the option to compare container and pod sandbox IDs. If they are equal, they will know that the container images will never exist.
  • Additionally, I would rate this code cleanup as highly crucial for slightly optimizing the CRI API lookups. More importantly, it will make them unit-testable and improve transparency and maintainability in the long run.

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@@ -209,7 +209,8 @@ std::string sinsp_container_manager::container_to_json(const sinsp_container_inf
inet_ntop(AF_INET, &iph, addrbuff, sizeof(addrbuff));
container["ip"] = addrbuff;

container["cni_json"] = container_info.m_pod_cniresult;
container["cni_json"] = container_info.m_pod_sandbox_cniresult;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better / more clear variable and function naming throughout.

@@ -327,7 +327,9 @@ class sinsp_container_info
int64_t m_cpu_period;
int32_t m_cpuset_cpu_count;
std::list<container_health_probe> m_health_probes;
std::string m_pod_cniresult;
std::string m_pod_sandbox_id;
std::map<std::string, std::string> m_pod_sandbox_labels;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having m_pod_sandbox_labels natively in the container will optimize k8s filterchecks and get rid of the confusion and overloaded container labels that even confused us maintainers.

* @param info status.info() Map
* @return Json::Value, can be null
*/
Json::Value get_info_jvalue(const google::protobuf::Map<std::string, std::string> &info);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common helper reduces code duplication and redundant lookups. We now consistently only do this lookup once for each API call, not anymore 3-5 times as previously. It also helps making the CRI API code base more transparent and clearer. In general, I made the variable and params naming more consistent and expressive throughout to improve maintainability going forward.

{
template<typename api> bool pod_uses_host_netns(const typename api::PodSandboxStatusResponse &resp)
{
const auto netns = resp.status().linux().namespaces().options().network();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A helper for that was unnecessary API surface. Removed the helper.

@@ -104,7 +95,7 @@ inline sinsp_container_type cri_interface<api>::get_cri_runtime_type() const
}

template<typename api>
inline grpc::Status cri_interface<api>::get_container_status(const std::string &container_id,
inline grpc::Status cri_interface<api>::get_container_status_resp(const std::string &container_id,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once more this type of renaming will make the API code base clearer and improve maintainability going forward.

if(container_info == nullptr || container_info->m_labels.empty())
// No m_pod_sandbox_id means no k8s.
// m_pod_sandbox_id retrieved from the ContainerStatusResponse CRI API call.
if(container_info == nullptr || container_info->m_pod_sandbox_id.empty())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leogr @Andreagit97 this should satisfy the desired behavior, please confirm.

return NULL;
}
RETURN_EXTRACT_STRING(container_info->m_pod_cniresult);
// Requires s_cri_extra_queries enabled, which is the default for Falco.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear docs.

"log_path": "busybox.0.log",
"linux": {
"resources": {
"metadata": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only adjusted indentation

@@ -741,4 +690,88 @@ TEST_F(sinsp_with_test_input, container_parser_cri_containerd)
ASSERT_EQ(get_field_as_string(evt, "k8s.pod.ip"), "10.244.0.2");
ASSERT_EQ(get_field_as_string(evt, "k8s.pod.cni.json"), "{\"bridge\":{\"IPConfigs\":null},\"eth0\":{\"IPConfigs\":[{\"Gateway\":\"10.244.0.1\",\"IP\":\"10.244.0.2\"}]}}");
}

TEST_F(sinsp_with_test_input, container_parser_cri_containerd_sandbox_container)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now extra tests for pod sandbox containers.

m_inspector.m_container_manager.add_container(std::move(container_info), init_thread_info);

auto sandbox_container_info = std::make_shared<sinsp_container_info>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test now also clearly shows that we have all we need in the container, no need for extra pod sandbox container lookup from the container cache anymore.

@@ -808,6 +785,27 @@ inline bool cri_interface<api>::parse(const libsinsp::cgroup_limits::cgroup_limi
container.m_imagetag.c_str(), container.m_image.c_str(),
container.m_imagedigest.c_str());
}

/*
* The recent refactor makes full use of PodSandboxStatusResponse, removing the need to access pod sandbox containers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnosek highlighting again that I am not introducing extra CRI API calls. They were already there before.

In summary, the design is that for each container that is not a pod sandbox container you get a ContainerStatsResponse once and also a PodSandboxStatusResponse once and fully make use of the fact that you have the PodSandboxStatusResponse.

If we want to complicate things and cache PodSandboxStatusResponse info in pod sandbox containers so that the call is only done once for each pod is a different discussion. I would not recommend complicating things for now, but we could think about it in the future.

@incertum incertum force-pushed the refactor/cri-api-structure branch from 187f557 to 87ab2f3 Compare December 22, 2023 00:07
{
sandbox_container_id.resize(12);
// Fallback: Retrieve PodSandboxStatusResponse fields stored in pod sandbox container
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For maximum robustness in retrieving the fields we get from the PodSandboxStatusResponse fetch them either from the container or as fallback from the pod sandbox container.

@incertum incertum force-pushed the refactor/cri-api-structure branch from 06aa5b1 to 3e72119 Compare December 22, 2023 23:46
@incertum incertum force-pushed the refactor/cri-api-structure branch from 3e72119 to decf0e4 Compare January 2, 2024 23:53
@incertum incertum changed the title refactor(cri): restructure API structure refactor(cri): restructure CRI API (improve robustness, clarity and maintainability) Jan 2, 2024
@incertum incertum marked this pull request as ready for review January 2, 2024 23:54
@poiana poiana requested a review from LucaGuerra January 2, 2024 23:54
@incertum
Copy link
Contributor Author

incertum commented Jan 2, 2024

@gnosek may I ask for a first-pass review from you? Thanks a bunch in advance?

@incertum
Copy link
Contributor Author

incertum commented Jan 2, 2024

If ok with everyone

/milestone 0.15.0

@poiana poiana added this to the 0.15.0 milestone Jan 2, 2024
@incertum incertum force-pushed the refactor/cri-api-structure branch from decf0e4 to 0184fda Compare January 6, 2024 04:05
@incertum incertum force-pushed the refactor/cri-api-structure branch from 0184fda to f7ba20d Compare January 16, 2024 17:25
@incertum incertum force-pushed the refactor/cri-api-structure branch from f7ba20d to a17eaba Compare February 11, 2024 02:32
@incertum
Copy link
Contributor Author

@falcosecurity/libs-maintainers kindly checking in on the status of this PR. Thank you.

@incertum
Copy link
Contributor Author

Please note that this PR has been open for 2 months now without receiving any review whatsoever. During this time, it appears that our focus has been primarily directed towards prioritizing and integrating similarly complex code alterations.

Why this PR matters:

  • Currently, we are not handling or enriching pod sandbox container processes correctly. This means adopters can falsely report that a container image is missing, when in fact it may be legitimate. Once this PR is merged, adopters will have the option to compare container and pod sandbox IDs. If they are equal, they will know that the container images will never exist.
  • Additionally, I would rate this code cleanup as highly crucial for slightly optimizing the CRI API lookups. More importantly, it will make them unit-testable and improve transparency and maintainability in the long run.

@FedeDP
Copy link
Contributor

FedeDP commented Feb 22, 2024

PR LGTM, but i am in no way an expert on this part of the code :)
I really like the refactor though, that is the most important thing.
/cc @gnosek 🙏

@leogr
Copy link
Member

leogr commented Feb 22, 2024

On my side, I will prioritize this once #1595 gets merged (hopefully soon).
/assign

I still want to confirm that this is targetted for libs 0.15

@incertum incertum force-pushed the refactor/cri-api-structure branch from a17eaba to a121523 Compare February 28, 2024 22:41
@incertum
Copy link
Contributor Author

@leogr this PR should be the next one. Rebased and added another commit on top in an attempt to improve comments and naming and code organization even more.

If it help, happy to schedule a call to more interactively review this PR, just let me know, thanks a bunch!

@leogr
Copy link
Member

leogr commented Feb 29, 2024

@leogr this PR should be the next one. Rebased and added another commit on top in an attempt to improve comments and naming and code organization even more.

If it help, happy to schedule a call to more interactively review this PR, just let me know, thanks a bunch!

I'm working on it. @therealbobo already did some tests, and everything worked properly (@therealbobo do you confirm?)/

cc @falcosecurity/libs-maintainers for visibility

@incertum
Copy link
Contributor Author

incertum commented Feb 29, 2024

I'm working on it. @therealbobo already did some tests, and everything worked properly (@therealbobo do you confirm?)/

Thank you! While it's a terrible diff, I believe this PR is less risky, plus at least it's backed up my unit tests (acknowledging some limitations w/ mock CRI response unit tests, but still better than before when we had nothing).

Perhaps, double check and pay close attention to all conditional checks etc.

Plus plz let me know if the filterchecks fallback mechanisms are ok with you:

  • We now consistently first try to get all pod sandbox related fields from the container
  • Falling back to fetching the separate pod sandbox container from the cache (which is now also fully populated, it was not the case before). The cniresult string was chosen as check for it as it should always be there in Kubernetes, while labels could legitimately be empty who knows. Perhaps down the road there could be another improvement PR, but I wouldn't aim for absolutely perfect right now either.
  • The fallback could be important if for example the pod sandbox CRI call failed after a successful container CRI call. Probably not very common, but it could happen since we do not mark the overall lookups within parse as unsuccessful when the container lookups succeeded but the subsequent additional pod sandbox CRI call failed.

@therealbobo
Copy link
Contributor

@leogr I did! I also see new commits. Let me check! 👀

@incertum
Copy link
Contributor Author

@leogr I did! I also see new commits. Let me check! 👀

Only the last commit is new, else conflict-free rebase. The last commit is merely one more renaming and adding more comments pass etc, no changes really.

@incertum
Copy link
Contributor Author

incertum commented Mar 1, 2024

Since we are doing this, let's do it right: I pushed one more commit to really be super consistent also in the order of calls in the CRI code base (plus few more comments). Ok to call me crazy on those details, but as said hopefully we don't need to touch this again for a bit.

@leogr
Copy link
Member

leogr commented Mar 1, 2024

Since we are doing this, let's do it right: I pushed one more commit to really be super consistent also in the order of calls in the CRI code base (plus few more comments). Ok to call me crazy on those details, but as said hopefully we don't need to touch this again for a bit.

👍 Testing this again (in progress).

@poiana poiana added the lgtm label Mar 1, 2024
@poiana
Copy link
Contributor

poiana commented Mar 1, 2024

LGTM label has been added.

Git tree hash: 0087a2cc73d9fc24d2e741c5bb350dc12e4c7e33

Copy link
Contributor

@FedeDP FedeDP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link
Contributor

poiana commented Mar 7, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: FedeDP, incertum, leogr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [FedeDP,incertum,leogr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@poiana poiana merged commit ad99020 into master Mar 7, 2024
41 checks passed
@poiana poiana deleted the refactor/cri-api-structure branch March 7, 2024 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants