Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/add-kafka-monitoring #625

Merged
merged 15 commits into from
Feb 13, 2025

Conversation

klinch0
Copy link
Contributor

@klinch0 klinch0 commented Feb 10, 2025

Summary by CodeRabbit

  • New Features

    • Launched a comprehensive Kafka monitoring dashboard with enhanced, real-time panels (including a "Current Raft State" view) for visualizing broker, controller, and partition metrics.
    • Improved metrics collection, alerting, and offset management across Kafka components.
    • Introduced advanced configuration options for flexible storage management and pod disruption handling.
    • Added support for dynamic queries and templating in the monitoring dashboard.
    • Enhanced Prometheus alerting rules for better monitoring of Kafka ecosystem components.
  • Chores

    • Updated application and operator versions, including refined image version mappings.
    • Added new entries for monitoring dashboards in configuration files.
  • Documentation

    • Clarified migration paths and Kubernetes version requirements.

@klinch0 klinch0 requested a review from kvaps as a code owner February 10, 2025 10:30
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Feb 10, 2025
Copy link
Contributor

coderabbitai bot commented Feb 10, 2025

Walkthrough

The pull request implements extensive updates across Kafka monitoring and deployment configurations. It introduces a new Grafana dashboard for Strimzi Kafka, updates Helm charts and version tracking, and enhances various Kubernetes CRDs with new fields and features (such as CSI volume support and offset management). Additionally, RBAC rules, deployment templates, image mappings, and Prometheus alert rules have been improved. These changes aim to improve observability, configuration flexibility, and resource management within the Kafka operator ecosystem.

Changes

File(s) Change Summary
dashboards/kafka/strimzi-kafka.json, hack/download-dashboards.sh, packages/extra/monitoring/dashboards.list, packages/system/kafka-operator/charts/strimzi-kafka-operator/files/grafana-dashboards/strimzi-kraft.json Added a new Grafana dashboard for Strimzi Kafka with panels for metrics (including a "Current Raft State" panel) and updated the download script and dashboards list to include the new dashboard.
packages/apps/kafka/Chart.yaml, packages/apps/versions_map Updated Kafka application version from 0.3.1 to 0.3.2 and revised version mapping with a specific commit hash.
packages/apps/kafka/templates/kafka.yaml, packages/apps/kafka/templates/metrics-configmap.yaml, packages/apps/kafka/templates/podscrape.yaml Introduced new metrics configurations: added metricsConfig for Kafka and Zookeeper, created a ConfigMap for JMX Prometheus Exporter configurations, and added a new VMPodScrape resource for scraping pod metrics.
packages/system/kafka-operator/charts/strimzi-kafka-operator/Chart.yaml, README.md, values.yaml, templates/_kafka_image_map.tpl Upgraded the Strimzi Kafka Operator chart (appVersion/version from 0.43.0 to 0.45.0), updated default image tags, Kubernetes prerequisites, added generatePodDisruptionBudget option, and refreshed Kafka image mappings to newer versions.
Multiple CRDs in packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/ (files: 040, 041, 045, 046, 047, 048, 049) Enhanced CRDs by adding new properties such as autoRebalance, host templates, CSI volume source support, advanced environment variable sourcing (valueFrom), offset management (listOffsets/alterOffsets), revised status conditions, and new operation modes like remove-disks.
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/020-RoleBinding-strimzi-cluster-operator.yaml, 023-ClusterRole-strimzi-cluster-operator-role.yaml, 023-RoleBinding-strimzi-cluster-operator.yaml, 031-RoleBinding-strimzi-cluster-operator-entity-operator-delegation.yaml, 060-Deployment-strimzi-cluster-operator.yaml Improved RBAC and deployment templates by introducing a dynamic $watchNamespaces variable, adjusting conditional logic for selecting RoleBinding/ClusterRoleBinding, adding deletion permissions for kafkarebalances, and adding a new environment variable for pod disruption budget generation.
packages/system/kafka-operator/templates/prometheus-rules.yaml Added comprehensive Prometheus alerting rules for monitoring Kafka brokers, Zookeeper, Kafka Connect, and Kafka Bridge components.

Sequence Diagram(s)

sequenceDiagram
    participant U as User
    participant G as Grafana Dashboard
    participant P as Prometheus
    participant K as Kafka Pod
    participant VS as VMPodScrape
    U->>G: Open Kafka dashboard
    G->>P: Query metrics for brokers, controllers, partitions
    P->>VS: Trigger pod metric scraping
    VS->>K: Fetch JMX metrics via ConfigMap settings
    K-->>P: Return Prometheus metrics
    P-->>G: Deliver queried metrics
    G-->>U: Display updated dashboard
Loading

Suggested labels

lgtm

Suggested reviewers

  • kvaps

Poem

In the code garden, I hop with glee,
New dashboards bloom for all to see.
Metrics and CRDs line my merry trail,
As updates and versions set sail.
With each commit, I thump my paw,
Celebrating change with a joyful "aha!" 🐇✨
Cheers from this little rabbit coder!


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@dosubot dosubot bot added the enhancement New feature or request label Feb 10, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (27)
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/060-Deployment-strimzi-cluster-operator.yaml (1)

143-146: Review Conditional Logic for New Pod Disruption Budget Environment Variable

A new conditional block has been introduced to set the environment variable STRIMZI_POD_DISRUPTION_BUDGET_GENERATION, which is only included when .Values.generatePodDisruptionBudget is not equal to true. Please verify that this is the intended behavior. If the aim is to override or provide a specific configuration when pod disruption budget generation is disabled, this condition is correct. However, to improve clarity and safeguard against potential type or evaluation issues, consider one of the following refinements:

  • Explicit Boolean Check: If .Values.generatePodDisruptionBudget is a boolean, you might use an explicit inverse condition such as if not .Values.generatePodDisruptionBudget to clearly indicate that the variable should be set when budget generation is disabled.
  • Comment Clarification: Adding a YAML comment above this block to explain the use-case can aid future maintainers in quickly understanding the condition.
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/020-RoleBinding-strimzi-cluster-operator.yaml (1)

5-5: Fix trailing whitespace.

Remove the trailing space at the end of line 5.

-  {{- $watchNamespaces = list -}}  
+  {{- $watchNamespaces = list -}}
🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 5-5: trailing spaces

(trailing-spaces)

packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-RoleBinding-strimzi-cluster-operator.yaml (1)

5-5: Trailing Whitespace Cleanup

A YAMLlint warning indicates trailing spaces on this line. Please remove any extraneous spaces to maintain clean formatting.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 5-5: trailing spaces

(trailing-spaces)

packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/031-RoleBinding-strimzi-cluster-operator-entity-operator-delegation.yaml (1)

1-38: Refined Namespace Selection in Entity Operator RoleBinding

Similar to the other RoleBinding templates, the introduction of $watchNamespaces with the conditional reset (when watchAnyNamespace is true) enhances the RBAC configuration’s consistency across different templates. Please remove the trailing spaces reported on line 5 to ensure proper YAML formatting.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 5-5: trailing spaces

(trailing-spaces)

packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-ClusterRole-strimzi-cluster-operator-role.yaml (1)

82-89: New Deletion Permission for KafkaRebalances

A new rule granting delete permissions on the kafkarebalances resource has been added, which supports the operator’s auto-rebalancing workflow. However, YAMLlint warns that the indentation for the list items on lines 86 and 88 is incorrect (expected 2 spaces but found 4). Please adjust the indentation as shown below to maintain consistency:

@@
-    - kafkarebalances
+  - kafkarebalances
@@
-    - delete
+  - delete
🧰 Tools
🪛 YAMLlint (1.35.1)

[warning] 86-86: wrong indentation: expected 2 but found 4

(indentation)


[warning] 88-88: wrong indentation: expected 2 but found 4

(indentation)

packages/system/kafka-operator/charts/strimzi-kafka-operator/values.yaml (2)

183-185: Enable Pod Disruption Budget Generation
The changes introduce explicit configuration for connectBuildTimeoutMs (which remains at 300000) and—more importantly—the new option generatePodDisruptionBudget: true. This provides enhanced control over pod availability during disruptions. Make sure that the corresponding deployment templates (which conditionally set environment variables based on this flag) are updated and tested.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 185-185: no new line character at the end of file

(new-line-at-end-of-file)


185-185: Ensure Newline at End of File
The static analysis tool reported a missing newline at the end of the file. Adding a newline will improve compatibility with YAML linters and editors.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 185-185: no new line character at the end of file

(new-line-at-end-of-file)

hack/download-dashboards.sh (1)

83-83: Add Strimzi Kafka Dashboard Reference
The new dashboard reference for the Strimzi Kafka dashboard (modules/340-monitoring-kubernetes/monitoring/grafana-dashboards//kafka/strimzi-kafka.json) has been added. Note the double slash (//) before kafka—please verify whether this is intentional or if it should be a single slash to avoid potential file path issues.

packages/apps/kafka/templates/metrics-configmap.yaml (1)

1-5: Initialize New Metrics ConfigMap
A new ConfigMap is created for Kafka and Zookeeper metrics configurations. The metadata section using the Helm template variable ({{ .Release.Name }}-metrics) looks appropriate; however, consider quoting the name value (e.g., "{{ .Release.Name }}-metrics") if YAML lint issues persist in certain environments.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 4-4: syntax error: expected , but found ''

(syntax)

packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/047-Crd-kafkaconnector.yaml (1)

95-118: Introduce Offset Management Configuration
The new listOffsets and alterOffsets properties extend the CRD schema to support offset management via ConfigMaps. The structure and required fields are clear. It may be helpful to add examples or further documentation in the CRD’s description to guide users on how to configure these settings.

packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/049-Crd-kafkarebalance.yaml (1)

105-120: Define moveReplicasOffVolumes Structure
The new property moveReplicasOffVolumes outlines an array of objects with brokerId and volumeIds. The minimum item restrictions help ensure valid configuration. Consider including an example snippet in the CRD documentation to clarify its usage.

packages/system/kafka-operator/templates/prometheus-rules.yaml (1)

134-152: Entity Operator Alerts – Pod Health Monitoring
The alerts for the entity operator (TopicOperatorContainerDown and UserOperatorContainerDown) correctly capture critical issues related to the operator pods. The expressions are clear; consider if a longer “for” duration might help avoid transient issues.

packages/system/kafka-operator/charts/strimzi-kafka-operator/README.md (2)

9-16: Migration and Deprecation Notice Update
The revised section clearly states that Strimzi 0.45 is the final version to support ZooKeeper-based clusters and MirrorMaker 1 deployments, and it urges migration to KRaft and MirrorMaker 2. The deprecation of the EnvVar Configuration Provider and MirrorMaker 2 Extensions is concisely noted. This update is clear and aligns with the migration objectives.


68-68: Markdown Unordered List Style
Static analysis suggests using asterisks (*) for unordered lists rather than dashes. Consider updating the prerequisites list (line 68) for consistency with the project’s markdown style guidelines.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

68-68: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

packages/system/kafka-operator/charts/strimzi-kafka-operator/files/grafana-dashboards/strimzi-kraft.json (2)

1076-1158: New “Current Raft State” Panel Addition
A new panel titled “Current Raft State” has been added. It queries max(kafka_server_raftmetrics_current_state) by (kubernetes_pod_name, current_state) and is configured as a table with transformations to rename fields. This addition will improve KRaft observability. Please verify that the query returns data in the expected format and that the transformation options align with your monitoring needs.


1224-1440: Revised Grid Positions for Leadership Panels
The grid positions for the “Current Quorum Leader”, “Current Voted Leader”, and “Current Quorum Epoch” panels have been updated (to widths of 6 instead of 8, with adjusted x-coordinates) to improve dashboard layout and visual coherence. Ensure that this new arrangement fits well with other panels and that there is no overlap or excessive white space.

packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/046-Crd-kafkabridge.yaml (2)

1192-1224: Addition of “valueFrom” in bridgeContainer Environment Variables
The environment variable specification for the bridge container now supports a valueFrom field. This allows sourcing environment variable values from a Secret or a ConfigMap. The use of a oneOf validation (ensuring either a literal value or a valueFrom is provided) is a good design choice.


1371-1413: Addition of “valueFrom” in initContainer Environment Variables
Similarly, the init container’s env section now includes support for a valueFrom property. This change enhances flexibility in provisioning dynamic configurations during container startup. Verify that the oneOf construct correctly enforces the required fields and that both secretKeyRef and configMapKeyRef configurations work as expected.

packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/045-Crd-kafkamirrormaker.yaml (1)

1335-1368: Review of Added valueFrom Property for mirrorMakerContainer Environment Variables

A new valueFrom property has been introduced under the container’s environment variable configuration. It supports sourcing values from either a secret (secretKeyRef) or a config map (configMapKeyRef), wrapped in a oneOf clause to enforce mutual exclusivity. This enhances the flexibility in configuring environment variables. Please verify that this pattern is consistent with similar configurations elsewhere in your codebase, and consider adding notes to help users understand when to use each reference type.

dashboards/kafka/strimzi-kafka.json (1)

1-2941: Overall New Grafana Dashboard for Strimzi Kafka

This new JSON file introduces a comprehensive Grafana dashboard for monitoring a Strimzi Kafka cluster. The structure includes well-defined sections for annotations, panels, templating, and refresh settings. Key highlights:

Panels and Queries: Multiple panels display metrics (e.g., broker status, controller activity, replica counts, network and disk metrics) via Prometheus queries that use dynamic variables (such as $kubernetes_namespace, $strimzi_cluster_name, $kafka_broker, etc.). Ensure that these Prometheus expressions return the expected time series data in your environment.

Templating: The templating variables (datasource, namespace, cluster name, broker, topic, and partition) are correctly defined. Verify that the regular expressions used to extract label values match your actual metrics.

Refresh and Layout: With a refresh rate of 5 seconds and careful grid positioning, the dashboard appears designed to provide near-real-time insights without overwhelming the backend.

Usability: Since this dashboard is newly added, consider including user documentation (or inline panel descriptions) to guide users on interpreting each metric.

Overall, the JSON is well-organized and seems consistent with Grafana dashboard best practices. If the queries have been validated in your environment, no further changes are required.

packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/041-Crd-kafkaconnect.yaml (3)

1106-1132: CSI Volume Type Addition in Volumes
The new csi block is added as an option for volume configuration. This change increases flexibility by allowing CSI‐specified storage options. Please double‑check that its properties (driver, fsType, nodePublishSecretRef, readOnly, and volumeAttributes) match the expected schema and align with similar changes in related CRDs.


1214-1247: Enhanced Environment Variable Reference in Connect Container
Introducing the valueFrom property under the connect container’s environment variable configuration enables externalizing secrets or configuration values via Kubernetes Secrets or ConfigMaps. Ensure that documentation clearly explains both the usage and limitations for users and that the one‑of schema correctly enforces a single source reference.


1357-1390: Enhanced Environment Variable Reference in Init Container
Similar to the connect container, the init container now supports specifying environment variables via valueFrom (using either a secret or ConfigMap). Verify that this change is consistent with other CRDs and that the resulting behavior is well‑documented for operators.

packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/048-Crd-kafkamirrormaker2.yaml (4)

2332-2358: CSI Volume Support in Pod Template for MirrorMaker2
A new csi option has been added to the volumes section within the pod template. This addition brings MirrorMaker2’s volume configuration inline with other components. Please verify that:

  • The set of properties (driver, fsType, nodePublishSecretRef, readOnly, volumeAttributes) conforms to the CSI VolumeSource spec,
  • The one‑of clause (ensuring that only one volume source is used) remains valid.

1431-1463: Dynamic Environment Variable Sourcing in Connect Container
The connect container now supports specifying environment variable values using a valueFrom block that allows either a secret or a ConfigMap reference. Make sure that the one‑of condition correctly enforces that only one of these is provided and that users are aware of this new option in the operator’s documentation.


1574-1607: Dynamic Environment Variable Sourcing in Init Container
The init container has also been updated to include the new valueFrom approach in its environment variable definitions. This change provides consistency throughout the pods and improves configuration flexibility. Please ensure compatibility with any consuming scripts or automated tests.


2378-2410: Dynamic Environment Variable Sourcing in Build Container
The build container now allows environment variables to be set via valueFrom, similar to the other container specifications. This enables users to source configuration or secrets dynamically during the build process. Verify that the schema’s one‑of clause functions as intended and that the change is documented in the build configuration guidelines.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3fa4dd3 and d234d58.

📒 Files selected for processing (26)
  • dashboards/kafka/strimzi-kafka.json (1 hunks)
  • hack/download-dashboards.sh (1 hunks)
  • packages/apps/kafka/Chart.yaml (1 hunks)
  • packages/apps/kafka/templates/kafka.yaml (2 hunks)
  • packages/apps/kafka/templates/metrics-configmap.yaml (1 hunks)
  • packages/apps/kafka/templates/podscrape.yaml (1 hunks)
  • packages/apps/versions_map (1 hunks)
  • packages/extra/monitoring/dashboards.list (1 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/Chart.yaml (2 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/README.md (5 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/040-Crd-kafka.yaml (19 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/041-Crd-kafkaconnect.yaml (8 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/045-Crd-kafkamirrormaker.yaml (2 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/046-Crd-kafkabridge.yaml (3 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/047-Crd-kafkaconnector.yaml (1 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/048-Crd-kafkamirrormaker2.yaml (10 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/049-Crd-kafkarebalance.yaml (3 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/files/grafana-dashboards/strimzi-kraft.json (4 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/020-RoleBinding-strimzi-cluster-operator.yaml (1 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-ClusterRole-strimzi-cluster-operator-role.yaml (1 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-RoleBinding-strimzi-cluster-operator.yaml (1 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/031-RoleBinding-strimzi-cluster-operator-entity-operator-delegation.yaml (1 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/060-Deployment-strimzi-cluster-operator.yaml (1 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/_kafka_image_map.tpl (1 hunks)
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/values.yaml (3 hunks)
  • packages/system/kafka-operator/templates/prometheus-rules.yaml (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • packages/apps/kafka/Chart.yaml
  • packages/system/kafka-operator/charts/strimzi-kafka-operator/Chart.yaml
🧰 Additional context used
🪛 YAMLlint (1.35.1)
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/020-RoleBinding-strimzi-cluster-operator.yaml

[error] 5-5: trailing spaces

(trailing-spaces)

packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/031-RoleBinding-strimzi-cluster-operator-entity-operator-delegation.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 5-5: trailing spaces

(trailing-spaces)

packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-ClusterRole-strimzi-cluster-operator-role.yaml

[warning] 86-86: wrong indentation: expected 2 but found 4

(indentation)


[warning] 88-88: wrong indentation: expected 2 but found 4

(indentation)

packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-RoleBinding-strimzi-cluster-operator.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 5-5: trailing spaces

(trailing-spaces)

packages/apps/kafka/templates/podscrape.yaml

[warning] 10-10: wrong indentation: expected 8 but found 6

(indentation)


[error] 4-4: syntax error: expected , but found ''

(syntax)

packages/system/kafka-operator/charts/strimzi-kafka-operator/values.yaml

[error] 185-185: no new line character at the end of file

(new-line-at-end-of-file)

packages/apps/kafka/templates/metrics-configmap.yaml

[error] 4-4: syntax error: expected , but found ''

(syntax)

🪛 LanguageTool
packages/system/kafka-operator/charts/strimzi-kafka-operator/README.md

[style] ~34-~34: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...fka® cluster via the HTTP protocol. * Includes Cruise Control - Automates the proces...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
packages/system/kafka-operator/charts/strimzi-kafka-operator/README.md

68-68: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

🔇 Additional comments (44)
packages/extra/monitoring/dashboards.list (1)

37-37: LGTM! The Strimzi Kafka dashboard has been added.

The addition aligns with the PR's objective of implementing Kafka monitoring and maintains the list's alphabetical order.

packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/_kafka_image_map.tpl (2)

8-31: LGTM! Version updates are consistent across all components.

The changes maintain version consistency by:

  • Updating default images to 3.9.0
  • Adding intermediate version 3.8.1
  • Adding latest version 3.9.0

3-4: Verify the file generation process.

The file header indicates this is a generated file. Please ensure the helm-charts/Makefile has been updated accordingly.

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Description: Check if Makefile exists and contains relevant generation rules

# Look for the Makefile
fd --type f "Makefile$" helm-charts/ --exec cat {} | grep -A 5 "kafka.*image.*map"

Length of output: 186


I'll run an improved search across all Makefiles to locate any reference to the generated file. This should help confirm whether the file generation rules for _kafka_image_map.tpl in the Makefile have been updated to reflect the new version changes.

If this search returns no relevant output, it may indicate that the generation process isn’t directly referencing _kafka_image_map.tpl (or the Makefile might be located in a non-standard directory). In that case, please verify manually that the generation logic (as mentioned in the file header comment "generated in helm-charts/Makefile") has been updated accordingly with the version changes.

Based on the current automated findings, additional manual verification is needed to validate the Makefile update.


🏁 Script executed:

#!/bin/bash
# Searching all Makefiles in the repository for references to _kafka_image_map.tpl
fd --type f "Makefile$" | xargs grep -H "kafka_image_map.tpl"

Length of output: 61


Manual verification required for file generation rules.

The automated search did not locate any generation rule for _kafka_image_map.tpl in Makefiles (including the expected helm-charts/Makefile). Please manually verify that the Makefile handling file generation is updated to reflect the changes in version updates for the Kafka components.

packages/apps/versions_map (1)

26-27: LGTM! Version updates follow the established pattern.

The changes appropriately:

  • Lock version 0.3.1 to specific commit b7375f7
  • Add new version 0.3.2 tracking HEAD
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/020-RoleBinding-strimzi-cluster-operator.yaml (1)

3-23: LGTM! Well-structured conditional namespace handling.

The implementation correctly:

  • Initializes $watchNamespaces variable
  • Handles watchAnyNamespace condition
  • Switches between ClusterRoleBinding and RoleBinding
  • Manages namespace assignment appropriately
🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 5-5: trailing spaces

(trailing-spaces)


[error] 20-20: trailing spaces

(trailing-spaces)

packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-RoleBinding-strimzi-cluster-operator.yaml (1)

1-7: Enhanced Namespace Handling in RBAC Configuration

The introduction of the $watchNamespaces variable along with the conditional check for $root.Values.watchAnyNamespace improves the flexibility of namespace management in the RoleBinding. This change helps ensure that when watchAnyNamespace is active, the list of watched namespaces defaults appropriately (to an empty list) while otherwise using the specified .Values.watchNamespaces. Note that YAML lint tools may incorrectly flag templating syntax (e.g. the leading {{- and trailing -}}) as errors—please verify that the rendered YAML is valid.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 5-5: trailing spaces

(trailing-spaces)

packages/apps/kafka/templates/podscrape.yaml (1)

1-41: New VMPodScrape Resource for Enhanced Metrics Collection

The new VMPodScrape resource is well-defined and structured to enable Prometheus to scrape pod metrics via the specified endpoints. The use of templated metadata (e.g. {{ .Release.Name }}-pod-scrape) fits the Helm convention; however, please verify that these values render correctly during deployment. Also, YAMLlint reports a warning regarding the indentation of the - separator: ; line (line 10) and a syntax issue on line 4, which are likely false positives due to the mix of Helm templating and YAML. It is recommended to double-check the rendered output.

🧰 Tools
🪛 YAMLlint (1.35.1)

[warning] 10-10: wrong indentation: expected 8 but found 6

(indentation)


[error] 4-4: syntax error: expected , but found ''

(syntax)

packages/apps/kafka/templates/kafka.yaml (2)

60-66: Addition of Kafka Metrics Configuration

The new metricsConfig field for Kafka, using jmxPrometheusExporter, is a strong enhancement for observability. Ensure that the referenced ConfigMap (named {{ .Release.Name }}-metrics) exists and contains the expected key (kafka-metrics-config.yml) to provide the proper configuration for metrics scraping.


77-82: Addition of Zookeeper Metrics Configuration

Similarly, the inclusion of a metricsConfig section for Zookeeper aligns with the extended monitoring capabilities intended by this PR. Please verify that the ConfigMap reference and key are correct and that the configuration structure matches what is required for the JMX Prometheus Exporter.

packages/system/kafka-operator/charts/strimzi-kafka-operator/values.yaml (2)

13-13: Update Default Image Tag Consistency
The default image tag for the operator has been updated to 0.45.0, aligning the operator version with the expected release. Please verify that downstream templates and documentation reflect this change.


129-129: Update Kafka Bridge Image Tag
The Kafka Bridge image tag is now set to 0.31.1. Ensure that this version is compatible with the rest of the Kafka operator components and that any related documentation is updated accordingly.

packages/apps/kafka/templates/metrics-configmap.yaml (2)

6-168: Comprehensive Kafka Metrics Rules Set
The rules under kafka-metrics-config.yml detail a wide range of patterns and metric types. The configurations appear thorough. It might be beneficial to add inline comments or examples for some of the more complex regex patterns to aid future maintainers.


169-190: Zookeeper Metrics Configuration
The zookeeper metrics rules are defined in a similar detailed manner. Just ensure that any changes or updates to JMX exporter rule formats are tracked in documentation.

packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/049-Crd-kafkarebalance.yaml (3)

34-37: Add Template Printer Column
The new additional printer column "Template" (using the JSONPath .metadata.annotations.strimzi\.io/rebalance-template) is added. Verify that this annotation is consistently applied to resources that serve as templates.


38-41: Add Status Printer Column
The addition of a "Status" printer column using the JSONPath .status.conditions[*].type offers a concise view of rebalancing status. Consider whether aggregating multiple condition types in one column meets your operational needs or if a custom formatter might provide clearer insight in the UI.


57-64: Expand Supported Rebalance Modes
The mode enum has been extended to include remove-disks, which permits intra-broker disk balancing. The updated description is detailed; please ensure supporting logic in the operator reconciler accommodates the nuances of disk rebalancing.

packages/system/kafka-operator/templates/prometheus-rules.yaml (8)

1-8: Resource Definition and Metadata Setup
The PrometheusRule resource is newly introduced and is configured with an appropriate API version, kind, and metadata (including labels such as role: alert-rules and app: strimzi). This is well structured and follows standard conventions.


10-92: Kafka Group Alerts – Comprehensive Coverage
The “kafka” group defines a broad set of alerts (e.g. for low disk space, under-replicated partitions, abnormal controller state, offline partitions, etc.). The PromQL expressions are detailed and incorporate dynamic labels in the annotations for context. Please double-check that thresholds (e.g. available disk space below 15%) and the durations (mostly 10s) align with operational expectations.


93-133: Zookeeper Group Alerts – Latency and Resource Conditions
Alerts in the “zookeeper” group (e.g. AvgRequestLatency, OutstandingRequests, running out of space, container restart frequency) are properly defined. Verify that the chosen numerical thresholds (like > 10 for latency or > 10 for outstanding requests) are well calibrated for your cluster’s performance characteristics.


152-178: Connect Group Alerts – Operational Failures Detection
The rules for the “connect” group (covering container downtimes and connector/task failures) are straightforward and useful for flagging issues. They should help in administering Kafka Connect reliability.


178-228: Bridge Group Alerts – HTTP and Latency Metrics
The “bridge” group includes several alerts to monitor container availability and HTTP error rates as well as latency metrics for both producers and consumers. The configuration appears well balanced; double-check that the alert durations and threshold values (e.g. error rate > 10) are appropriate for your environment’s noise level.


228-238: MirrorMaker Alert – Container Health Check
The single alert for MirrorMaker ensures that container downtime is caught if present for more than 3 minutes. This rule is direct and effective.


238-264: Kafka Exporter Alerts – Replication and Lag Monitoring
The “kafkaExporter” group covers alerts for under-replicated partitions, excessive consumer group lag, and a lack of message updates. These rules use concise PromQL expressions and dynamic annotations. Please confirm that the thresholds (e.g. consumer lag > 1000) have been validated in your production setting.


264-276: Certificates Alert – Expiry Notification
The “certificates” group alerts administrators when a certificate is nearing expiration. Using a fixed time window of 30 days seems reasonable; ensure that this duration matches your operational policy.

packages/system/kafka-operator/charts/strimzi-kafka-operator/README.md (7)

27-27: KRaft Support Emphasis
The highlighted “KRaft support” line explicitly informs users about running Kafka clusters in KRaft mode. This concise message is helpful given the migration push.


32-32: MirrorMaker Inclusion Clarity
The updated description for Kafka MirrorMaker inclusion is clear. It helps set the user’s expectation regarding mirroring functionality between Kafka clusters.


35-35: Auto-Rebalancing Feature Announcement
The note on auto-rebalancing when scaling is an attractive feature, ensuring that cluster health is maintained during scale operations. Confirm that the operational behavior meets your design goals.


36-36: Tiered Storage Feature Description
The “Tiered storage” update provides clear insight into offloading older data. Ensure that the documentation explains any configuration nuances associated with this feature.


68-68: Prerequisites Update – Kubernetes Version Requirement
Updating the prerequisite to “Kubernetes 1.25+” is an important operational change. Please verify that all deployment scripts and instructions reflect this updated requirement.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

68-68: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


105-105: Default Image Tag Update
The default image tag for all images (except Kafka Bridge) has been updated to 0.45.0. This is consistent with the operator upgrade and should be tested to ensure full compatibility with the new features.


169-169: Kafka Bridge Image Tag Update
The update of the Kafka Bridge image tag to 0.31.1 is clearly documented. Ensure that any downstream configurations or integrations referencing the previous tag are updated accordingly.

packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/046-Crd-kafkabridge.yaml (1)

1095-1121: CSI Volume Support in Volumes Section
A new csi volume type has been added to the volumes section. The properties include driver, fsType, nodePublishSecretRef, readOnly, and volumeAttributes. This enhancement will allow more flexible storage configurations. Ensure that all required fields are correctly populated during deployment and that it’s compatible with your CSI drivers.

packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/045-Crd-kafkamirrormaker.yaml (2)

1271-1290: Review of CSI Volume Type Addition

The new csi volume definition is clearly added with its own object schema. The properties (driver, fsType, nodePublishSecretRef, readOnly, and volumeAttributes) are defined in a straightforward manner. Please verify that these fields match the Kubernetes CSI specification and your operator’s expectations. It may also be beneficial to consider explicitly listing required keys as per your validation needs.


1291-1297: Ensure Inclusion of CSI in Volume Options

The oneOf block now explicitly requires the presence of the csi property when that volume type is used. This approach clearly distinguishes it from other volume types (such as secret, configMap, etc.), and it aligns with your documentation note of replacing an empty required array with a firm requirement for the csi type.

packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/048-Crd-kafkamirrormaker2.yaml (3)

340-363: Offset Management in Source Connector Configuration
The new properties listOffsets and alterOffsets enable dynamic offset management for the source connector. In this section, the schema now expects a toConfigMap (for listing offsets) and a fromConfigMap (for altering offsets), each with a required name field. Please verify that the naming convention for the referenced ConfigMaps is clearly communicated to users and that error messages are helpful if misconfigured.


396-419: Offset Management in Heartbeat Connector Configuration
Similar to the source connector, the heartbeat connector now includes listOffsets and alterOffsets settings. This enhancement supports more granular control of connector offset management. Ensure that both sections use consistent validation rules and that the change is reflected in any related documentation.


452-475: Offset Management in Checkpoint Connector Configuration
The addition of listOffsets and alterOffsets in the checkpoint connector mirrors the changes seen in the source and heartbeat connectors. Confirm that these additions consistently require a respective ConfigMap reference and that the expected behavior is thoroughly tested.

packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/040-Crd-kafka.yaml (7)

483-495: LGTM: Host template configuration for broker hostnames

The new hostTemplate and advertisedHostTemplate properties provide a flexible way to customize broker hostnames using placeholders like {nodeId} and {nodePodName}. The implementation looks good with clear descriptions.


1576-1594: LGTM: CSI volume source support added

The addition of CSI volume source support enhances storage flexibility by allowing the use of Container Storage Interface drivers. The implementation includes all necessary fields like driver, fsType, nodePublishSecretRef, and volumeAttributes.


1832-1874: LGTM: Environment variable value sourcing from secrets and configmaps

The new valueFrom property enables environment variables to be populated from secrets and configmaps, providing better security and configuration management. The implementation is consistent across all container templates.

Also applies to: 1975-2017, 2826-2868, 3278-3320, 3540-3582, 3683-3725, 4014-4056, 4157-4199, 4540-4582, 4683-4725, 5014-5056, 5157-5199, 5540-5582, 5683-5725, 6014-6056, 6157-6199, 6826-6868, 7165-7207


6411-6432: LGTM: Auto-rebalancing configuration for scaling operations

The new autoRebalance property provides configuration for automatic rebalancing when brokers are added or removed. The implementation includes:

  • Mode specification (add-brokers or remove-brokers)
  • Template reference for rebalance configuration

8333-8364: LGTM: Auto-rebalancing status tracking

The new autoRebalance status field provides comprehensive tracking of rebalancing operations:

  • Current state (Idle/RebalanceOnScaleDown/RebalanceOnScaleUp)
  • Last transition timestamp
  • Mode-specific broker lists for running/queued operations

1576-1594: Verify CSI volume source compatibility

Let's verify the CSI volume source implementation across different Kubernetes versions and check for any known issues.

Also applies to: 3130-3148, 3917-3935, 4498-4516, 5917-5935, 7123-7141, 8021-8039

✅ Verification successful

🌐 Web query:

What is the minimum Kubernetes version required for CSI volume source support in CRDs?

💡 Result:

The minimum Kubernetes version required for CSI volume source support in Custom Resource Definitions (CRDs) is v1.14. This is based on:

  1. The CSIDriver object (a CRD used for CSI configuration) became available as a built-in Kubernetes API in v1.14 as a beta feature[8]. This eliminated the need for manual CRD installation through YAML manifests.

  2. Key implementation details:

  • Beta status meant the feature was enabled by default starting from v1.14[8]
  • CSIDriver API integration allowed cluster administrators to configure driver behaviors through native Kubernetes objects rather than external CRDs[8]
  • This version alignment matches when CSI itself reached GA status in Kubernetes v1.13[1], with subsequent improvements to CRD handling

For production use cases requiring full stability, it's recommended to use Kubernetes v1.16+, as this brought additional maturity to both CSI and CRD versioning features[10]. However, basic functionality exists since v1.14[8].

Citations:


CSI Volume Source Compatibility Verified
The web query confirmed that CSI volume source support within CRDs requires a minimum Kubernetes version of v1.14 (with production setups recommending v1.16+). Thus, the definition in the YAML snippet (and the similar blocks at lines 3130-3148, 3917-3935, 4498-4516, 5917-5935, 7123-7141, 8021-8039) is compatible with Kubernetes versions that provide CSI volume source support.


483-495: Verify auto-rebalancing configuration and status handling

Let's verify that the auto-rebalancing feature is properly documented and that there are no conflicting configurations.

Also applies to: 6411-6432, 8333-8364

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Search for auto-rebalancing related documentation and examples
rg -A 5 "autoRebalance" --type md

Length of output: 33


Below is a new shell script to search the entire repository for any occurrence of “autoRebalance” (not just in Markdown files) so that we can check if the auto‐rebalancing configuration and status handling are documented or defined elsewhere:

Please run this script and review the output. If nothing shows up (or you only find irrelevant matches), we will need to manually verify the intended documentation and configuration of the auto-rebalancing feature.


🏁 Script executed:

#!/bin/bash
# Search the entire repository (all file types) for references to "autoRebalance"
rg -R "autoRebalance"

Length of output: 426


Below is a revised shell script that correctly searches the entire repository for any reference to "autoRebalance". Please execute the script and share the output for further analysis:

After running this script, please provide the output so we can continue verifying the auto-rebalancing configuration and status handling.


🏁 Script executed:

#!/bin/bash
# Correctly search the entire repository for references to "autoRebalance"
rg "autoRebalance"

Length of output: 254


Below is a script that will extract additional context around the autoRebalance occurrences as well as the contents of the referenced line ranges (6411–6432 and 8333–8364) in the CRD file. This will help us verify that the auto-rebalancing feature is properly documented and that there are no conflicting configurations. Please run the script and share its output.


🏁 Script executed:

#!/bin/bash
echo "---- Retrieving autoRebalance context in CRD ----"
rg -C 5 "autoRebalance:" packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/040-Crd-kafka.yaml

echo
echo "---- Extracting lines 6411-6432 ----"
sed -n '6411,6432p' packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/040-Crd-kafka.yaml

echo
echo "---- Extracting lines 8333-8364 ----"
sed -n '8333,8364p' packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/040-Crd-kafka.yaml

Length of output: 6102


Auto-Rebalancing Documentation Verified

The CRD clearly distinguishes between the auto-rebalancing configuration (lines 6411–6432) and status handling (lines 8333–8364). Both sections include well-defined types, enumerations, and descriptions without conflicts.

@klinch0 klinch0 force-pushed the feature/add-kafka-monitoring branch from f96a044 to d234d58 Compare February 10, 2025 12:29
@gecube
Copy link
Collaborator

gecube commented Feb 10, 2025

Do we support KRAFT kafka instances? If yes - there should be different monitoring for them.

@gecube
Copy link
Collaborator

gecube commented Feb 10, 2025

and also please clean the history - it looks like that it should be single commit to the main.

Copy link
Member

@lllamnyp lllamnyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Я не знаю, какая часть этого кода просто взята из открытых источников, а какая -- самописная, мне и самому не очень нравится идея поддерживать свой форк кода в том случае, когда у нас naming convention -- pod, а в опенсорсном источнике kubernetes_pod_name, но если это наши куски кода, то лучше их привести в соответствие.

packages/apps/kafka/templates/podscrape.yaml Outdated Show resolved Hide resolved
packages/apps/kafka/templates/podscrape.yaml Outdated Show resolved Hide resolved
severity: warning
annotations:
summary: 'Kafka under replicated partitions'
description: 'There are {{ $value }} under replicated partitions on {{ $labels.kubernetes_pod_name }}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Судя по всему, если учесть мои комменты выше, то и prometheus rules придётся исправлять.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
packages/system/kafka-operator/templates/prometheus-rules.yaml (1)

212-227: Improve HTTP error rate alerts.

The current implementation has several limitations:

  • Uses absolute thresholds which don't scale with traffic
  • Only excludes favicon.ico from 4xx errors

Consider using error rates instead of absolute counts:

     - alert: Http4xxErrorRate
-      expr: strimzi_bridge_http_server_requestCount_total{code=~"^4..$", container=~"^.+-bridge", path !="/favicon.ico"} > 10
+      expr: sum(rate(strimzi_bridge_http_server_requestCount_total{code=~"^4..$", container=~"^.+-bridge"}[5m])) by (path) / sum(rate(strimzi_bridge_http_server_requestCount_total{container=~"^.+-bridge"}[5m])) by (path) > 0.05
       for: 1m
       labels:
         severity: warning
       annotations:
         summary: 'Kafka Bridge returns code 4xx too often'
-        description: 'Kafka Bridge returns code 4xx too much ({{ $value }}) for the path {{ $labels.path }}'
+        description: 'Kafka Bridge error rate is {{ $value | humanizePercentage }} for the path {{ $labels.path }}'

     - alert: Http5xxErrorRate
-      expr: strimzi_bridge_http_server_requestCount_total{code=~"^5..$", container=~"^.+-bridge"} > 10
+      expr: sum(rate(strimzi_bridge_http_server_requestCount_total{code=~"^5..$", container=~"^.+-bridge"}[5m])) by (path) / sum(rate(strimzi_bridge_http_server_requestCount_total{container=~"^.+-bridge"}[5m])) by (path) > 0.01
       for: 1m
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d234d58 and 84a2394.

📒 Files selected for processing (2)
  • packages/apps/kafka/templates/podscrape.yaml (1 hunks)
  • packages/system/kafka-operator/templates/prometheus-rules.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/apps/kafka/templates/podscrape.yaml
🔇 Additional comments (1)
packages/system/kafka-operator/templates/prometheus-rules.yaml (1)

264-276: LGTM! Well-designed certificate expiration monitoring.

The certificate expiration alert is well-implemented with appropriate thresholds and context.

Comment on lines +110 to +117
- alert: ZookeeperRunningOutOfSpace
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 5368709120
for: 10s
labels:
severity: warning
annotations:
summary: 'Zookeeper is running out of free disk space'
description: 'There are only {{ $value }} bytes available at {{ $labels.persistentvolumeclaim }} PVC'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Align Zookeeper disk space alert with Kafka's approach.

The Zookeeper disk space alert uses an absolute value (5GB) while Kafka uses a percentage. This inconsistency could cause issues with different volume sizes.

     - alert: ZookeeperRunningOutOfSpace
-      expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 5368709120
+      expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 15
       for: 10s
       labels:
         severity: warning
       annotations:
         summary: 'Zookeeper is running out of free disk space'
-        description: 'There are only {{ $value }} bytes available at {{ $labels.persistentvolumeclaim }} PVC'
+        description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- alert: ZookeeperRunningOutOfSpace
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 5368709120
for: 10s
labels:
severity: warning
annotations:
summary: 'Zookeeper is running out of free disk space'
description: 'There are only {{ $value }} bytes available at {{ $labels.persistentvolumeclaim }} PVC'
- alert: ZookeeperRunningOutOfSpace
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 15
for: 10s
labels:
severity: warning
annotations:
summary: 'Zookeeper is running out of free disk space'
description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'

Comment on lines +12 to +19
- alert: KafkaRunningOutOfSpace
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} < 15
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka is running out of free disk space'
description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Adjust disk space alert threshold and duration.

The disk space alert might generate too many false positives:

  • 15% threshold might be too low for large volumes
  • 10s duration is too short for disk space trends
     - alert: KafkaRunningOutOfSpace
       expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} < 15
-      for: 10s
+      for: 5m
       labels:
         severity: warning
       annotations:
         summary: 'Kafka is running out of free disk space'
         description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- alert: KafkaRunningOutOfSpace
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} < 15
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka is running out of free disk space'
description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'
- alert: KafkaRunningOutOfSpace
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} < 15
for: 5m
labels:
severity: warning
annotations:
summary: 'Kafka is running out of free disk space'
description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'

Comment on lines +20 to +59
- alert: UnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka under replicated partitions'
description: 'There are {{ $value }} under replicated partitions on {{ $labels.pod }}'
- alert: AbnormalControllerState
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) by (strimzi_io_name) != 1
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka abnormal controller state'
description: 'There are {{ $value }} active controllers in the cluster'
- alert: OfflinePartitions
expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka offline partitions'
description: 'One or more partitions have no leader'
- alert: UnderMinIsrPartitionCount
expr: kafka_server_replicamanager_underminisrpartitioncount > 0
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka under min ISR partitions'
description: 'There are {{ $value }} partitions under the min ISR on {{ $labels.pod }}'
- alert: OfflineLogDirectoryCount
expr: kafka_log_logmanager_offlinelogdirectorycount > 0
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka offline log directories'
description: 'There are {{ $value }} offline log directories on {{ $labels.pod }}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Increase alert durations for replication and partition issues.

The 10s duration for replication and partition alerts is too short and may cause alert noise. These issues often require time to stabilize after temporary blips.

Apply this change to all alerts in this range:

     - alert: UnderReplicatedPartitions
       expr: kafka_server_replicamanager_underreplicatedpartitions > 0
-      for: 10s
+      for: 2m
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- alert: UnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka under replicated partitions'
description: 'There are {{ $value }} under replicated partitions on {{ $labels.pod }}'
- alert: AbnormalControllerState
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) by (strimzi_io_name) != 1
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka abnormal controller state'
description: 'There are {{ $value }} active controllers in the cluster'
- alert: OfflinePartitions
expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka offline partitions'
description: 'One or more partitions have no leader'
- alert: UnderMinIsrPartitionCount
expr: kafka_server_replicamanager_underminisrpartitioncount > 0
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka under min ISR partitions'
description: 'There are {{ $value }} partitions under the min ISR on {{ $labels.pod }}'
- alert: OfflineLogDirectoryCount
expr: kafka_log_logmanager_offlinelogdirectorycount > 0
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka offline log directories'
description: 'There are {{ $value }} offline log directories on {{ $labels.pod }}'
- alert: UnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 2m
labels:
severity: warning
annotations:
summary: 'Kafka under replicated partitions'
description: 'There are {{ $value }} under replicated partitions on {{ $labels.pod }}'
- alert: AbnormalControllerState
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) by (strimzi_io_name) != 1
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka abnormal controller state'
description: 'There are {{ $value }} active controllers in the cluster'
- alert: OfflinePartitions
expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka offline partitions'
description: 'One or more partitions have no leader'
- alert: UnderMinIsrPartitionCount
expr: kafka_server_replicamanager_underminisrpartitioncount > 0
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka under min ISR partitions'
description: 'There are {{ $value }} partitions under the min ISR on {{ $labels.pod }}'
- alert: OfflineLogDirectoryCount
expr: kafka_log_logmanager_offlinelogdirectorycount > 0
for: 10s
labels:
severity: warning
annotations:
summary: 'Kafka offline log directories'
description: 'There are {{ $value }} offline log directories on {{ $labels.pod }}'

Comment on lines +256 to +263
- alert: NoMessageForTooLong
expr: changes(kafka_topic_partition_current_offset[10m]) == 0
for: 10s
labels:
severity: warning
annotations:
summary: 'No message for 10 minutes'
description: 'There is no messages in topic {{ $labels.topic}}/partition {{ $labels.partition }} for 10 minutes'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refine the no-message alert logic.

The current implementation might generate false positives for topics with naturally low message rates or during off-peak hours.

     - alert: NoMessageForTooLong
       expr: changes(kafka_topic_partition_current_offset[10m]) == 0
-      for: 10s
+      for: 30m
       labels:
         severity: warning
+        topic_pattern: '{{ $labels.topic }}'
       annotations:
         summary: 'No message for 10 minutes'
-        description: 'There is no messages in topic {{ $labels.topic}}/partition {{ $labels.partition }} for 10 minutes'
+        description: 'No messages in topic {{ $labels.topic }}/partition {{ $labels.partition }} for 30 minutes. Verify if this is expected for this topic pattern.'
✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Find topic patterns or naming conventions that might help identify topics with different message rate expectations
rg -g '*.{yaml,yml,json}' -i 'topic.*pattern|topic.*name' --no-filename | sort -u

Length of output: 7477


Refine No-Message Alert Logic

The suggested changes improve the alert in two important ways:

  • Extend the duration: Increasing the “for” duration from 10 seconds to 30 minutes reduces the risk of false positives in topics with naturally low or off-peak message rates.
  • Add topic-based filtering: Introducing the topic_pattern label (using the current topic label) aligns with existing conventions (as seen in other config fields like topicsPattern) and helps refine which topics should trigger the alert.

Please ensure that the alert’s summary and description consistently reflect the new duration (i.e., update “No message for 10 minutes” if the check duration is set to 30 minutes).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- alert: NoMessageForTooLong
expr: changes(kafka_topic_partition_current_offset[10m]) == 0
for: 10s
labels:
severity: warning
annotations:
summary: 'No message for 10 minutes'
description: 'There is no messages in topic {{ $labels.topic}}/partition {{ $labels.partition }} for 10 minutes'
- alert: NoMessageForTooLong
expr: changes(kafka_topic_partition_current_offset[10m]) == 0
for: 30m
labels:
severity: warning
topic_pattern: '{{ $labels.topic }}'
annotations:
summary: 'No message for 10 minutes'
description: 'No messages in topic {{ $labels.topic }}/partition {{ $labels.partition }} for 30 minutes. Verify if this is expected for this topic pattern.'

"pluginVersion": "7.4.5",
"targets": [
{
"expr": "max(kafka_server_raftmetrics_current_state) by (kubernetes_pod_name, current_state)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Похоже, дашборд тоже потребует правки =(

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 12, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
dashboards/kafka/strimzi-kafka.json (1)

1-2941: 💡 Verification agent

🧩 Analysis chain

Verify KRAFT mode compatibility.

Based on the PR comments, verify if this dashboard supports monitoring KRAFT (KRaft) mode Kafka clusters, as the metrics might differ from ZooKeeper-based clusters.


🌐 Web query:

What metrics are different between ZooKeeper-based Kafka and KRaft mode Kafka?

💡 Result:

The shift from ZooKeeper-based Kafka to KRaft mode introduces significant differences in performance, scalability, and operational metrics. Here’s a breakdown of key metric variations:

Metadata Read Latency

KRaft reduces metadata read latency by storing state changes as ordered events in an internal Kafka topic (__cluster_metadata). Brokers recover by replaying events from their last offset instead of reloading state from ZooKeeper[1][4]. This lowers recovery latency during failures[1][5].

Maximum Partition Scalability

ZooKeeper-based clusters historically capped partitions at ~200k due to coordination bottlenecks between ZooKeeper and Kafka controllers[5]. KRaft eliminates this limitation by integrating metadata management directly into Kafka’s architecture, supporting up to 2 million partitions with faster controller failover[4][5].

Failover Delay

ZooKeeper’s message propagation inconsistencies often caused delayed failovers during leader elections[1]. KRaft’s Raft consensus protocol ensures faster leader election (near-instantaneous) and reduces unavailability windows during controller transitions[2][4]. Confluent benchmarks show 10x faster recovery times for clusters with 2M partitions under KRaft[5].

Throughput and Operational Efficiency

  • Metadata Propagation: ZooKeeper used RPCs for state synchronization, while KRaft leverages an event-driven log for efficient metadata replication[4][9].
  • Startup/Shutdown Times: KRaft avoids redundant state reloads during restarts since controllers retain in-memory metadata[2][4].
Metric ZooKeeper Mode KRaft Mode
Max Partitions ~200k 2M+
Controller Failover Seconds to minutes Near-instantaneous
Metadata Storage External (ZooKeeper) Internal (__cluster_metadata)
Recovery Latency Higher due to ZK sync Lower via event replay[1][5]
Security Complexity Dual models (ZK + Kafka) Simplified single model[1][3]

Operational Overhead

KRft eliminates the need for separate ZooKeeper clusters, reducing deployment complexity (single vs. dual systems)[3][7]. Tools like kafka-topics shift from --zookeeper flags to --bootstrap-server endpoints[2][6].

For existing clusters migrating to KRaft, phased rollouts are required to transition metadata storage while maintaining dual-write compatibility until full cutover[3][6].

Citations:


KRaft Compatibility: Update Required for ZooKeeper-Specific Panels

  • The dashboard currently queries several ZooKeeper-specific metrics (e.g., the panel using kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms) that won’t be available in KRaft mode.
  • In KRaft mode, Kafka manages metadata internally (via the __cluster_metadata topic) and exhibits different metrics for controller behavior (e.g., reduced metadata read latency and faster failover). This suggests that some panels may return no data or misleading values when monitoring KRaft clusters.
  • Please verify that all panels relying on legacy ZooKeeper metrics have KRaft-compatible alternatives or include conditionals to handle both deployment modes.
  • Consider updating the documentation to clarify that this dashboard is designed for ZooKeeper-based clusters unless the necessary modifications for KRaft mode are implemented.
🧰 Tools
🪛 Biome (1.9.4)

[error] 2670-2670: The key pod was already declared.

This where a duplicated key was declared again.

If a key is defined multiple times, only the last definition takes effect. Previous definitions are ignored.

(lint/suspicious/noDuplicateObjectKeys)


[error] 2694-2694: The key pod was already declared.

This where a duplicated key was declared again.

If a key is defined multiple times, only the last definition takes effect. Previous definitions are ignored.

(lint/suspicious/noDuplicateObjectKeys)

🧹 Nitpick comments (4)
dashboards/kafka/strimzi-kafka.json (4)

18-22: Consider adding dashboard links for related dashboards.

The dashboard links array is empty. Consider adding links to related dashboards (e.g., Zookeeper, Kafka Connect) for better navigation between different components of the Kafka ecosystem.

  "links": [
+   {
+     "title": "Zookeeper Dashboard",
+     "type": "dashboard",
+     "tags": ["Strimzi", "Zookeeper"]
+   },
+   {
+     "title": "Kafka Connect Dashboard",
+     "type": "dashboard",
+     "tags": ["Strimzi", "Kafka Connect"]
+   }
  ],

2816-2818: Consider adjusting the refresh interval based on load.

The dashboard is set to refresh every 5 seconds, which might be too frequent for high-traffic clusters and could impact Prometheus performance.

-  "refresh": "5s",
+  "refresh": "30s",

2812-2812: Fix typo in panel title.

There's a typo in the ZooKeeper latency panel title.

-  "title": "ZK request latecy",
+  "title": "ZK Request Latency",

23-655: Consider adding alert thresholds for critical metrics.

The broker health metrics panel group provides essential monitoring but lacks alert thresholds for critical conditions. Consider adding alert rules for:

  • Under-replicated partitions > 0
  • Offline partitions > 0
  • Active controllers != 1
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84a2394 and cc0222a.

📒 Files selected for processing (1)
  • dashboards/kafka/strimzi-kafka.json (1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
dashboards/kafka/strimzi-kafka.json

[error] 2670-2670: The key pod was already declared.

This where a duplicated key was declared again.

If a key is defined multiple times, only the last definition takes effect. Previous definitions are ignored.

(lint/suspicious/noDuplicateObjectKeys)


[error] 2694-2694: The key pod was already declared.

This where a duplicated key was declared again.

If a key is defined multiple times, only the last definition takes effect. Previous definitions are ignored.

(lint/suspicious/noDuplicateObjectKeys)

🔇 Additional comments (2)
dashboards/kafka/strimzi-kafka.json (2)

1-17: LGTM! Well-structured dashboard annotations.

The dashboard annotations are properly configured with appropriate data source and visibility settings.


2823-2929: LGTM! Well-structured templating variables.

The templating variables are properly configured with appropriate data sources, regex patterns, and dependencies between variables. This allows for flexible filtering and drill-down capabilities.

Comment on lines +2674 to +2684
"pod": true,
"strimzi_io_broker_role": true,
"strimzi_io_cluster": true,
"strimzi_io_component_type": true,
"strimzi_io_controller": true,
"strimzi_io_controller_name": true,
"strimzi_io_controller_role": true,
"strimzi_io_kind": true,
"strimzi_io_name": true,
"strimzi_io_pod_name": true,
"strimzi_io_pool_name": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix duplicate 'pod' key in transformations.

The transformations section contains duplicate 'pod' keys which could lead to unexpected behavior.

  "excludeByName": {
    "Time": true,
    "__name__": true,
    "container": true,
    "endpoint": true,
    "instance": true,
    "job": true,
    "pod": true,
    "namespace": true,
    "node_ip": true,
    "node": true,
-   "pod": true,
    "strimzi_io_broker_role": true,

Also applies to: 2694-2694

@lllamnyp lllamnyp merged commit ecfb02a into aenix-io:main Feb 13, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants