Skip to content

Commit

Permalink
Refresh archicture doc (#1022)
Browse files Browse the repository at this point in the history
* Refresh archicture doc

* Add CONTRIBUTING

* Update docs/Architecture.md

Co-authored-by: Julien Pinsonneau <[email protected]>

* Update docs/Architecture.md

Co-authored-by: Julien Pinsonneau <[email protected]>

* More details on CLI mdoes

---------

Co-authored-by: Julien Pinsonneau <[email protected]>
  • Loading branch information
jotak and jpinsonneau authored Jan 20, 2025
1 parent 4859d5a commit e4ea8aa
Show file tree
Hide file tree
Showing 5 changed files with 101 additions and 49 deletions.
3 changes: 3 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Contributing

Please refer to [NetObserv projects contribution guide](https://github.com/netobserv/documents/blob/main/CONTRIBUTING.md).
147 changes: 98 additions & 49 deletions docs/Architecture.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,98 @@
# Network Observability Architecture

The Network Observability solution consists on a [Network Observability Operator (NOO)](https://github.com/netobserv/network-observability-operator)
that deploys, configures and controls the status of the following components:

* [Network Observability eBPF Agent](https://github.com/netobserv/netobserv-ebpf-agent/)
* It is attached to all the interfaces in the host network and listen for each network packet that
is submitted or received by their egress/ingress. The agent aggregates all the packets by source
and destination addresses, protocol, etc... into network flows that are submitted to the
Flowlogs-Pipeline flow processor.
* [Network Observabiilty Flowlogs-Pipeline (FLP)](https://github.com/netobserv/flowlogs-pipeline)
* It receives the raw flows from the agent and decorates them with Kubernetes information (Pod
and host names, namespaces, etc.), and stores them as JSON into a [Loki](https://grafana.com/oss/loki/)
instance.
* [Network Observability Console Plugin](https://github.com/netobserv/network-observability-console-plugin)
* It is attached to the Openshift console as a plugin (see Figure 1, though it can be also
deployed in standalone mode). The Console Plugin queries the flows information stored in Loki
and allows filtering flows, showing network topologies, etc.

![Netobserv frontend architecture](./assets/frontend.png)
Figure 1: Console Plugin deployment

There are two existing deployment modes for Network Observability: direct mode and Kafka mode.

## Direct-mode deployment

In direct mode (figure 2), the eBPF agent sends the flows information to Flowlogs-Pipeline encoded as Protocol
Buffers (binary representation) via [gRPC](https://grpc.io/). In this scenario, Flowlogs-Pipeline
is usually deployed as a DaemonSet so there is a 1:1 communication between the Agent and FLP internal
to the host, so we minimize cluster network usage.

![Netobserv component's architecture (direct mode)](./assets/architecture-direct.png)
Figure 2: Direct deployment

## Kafka-mode deployment

In Kafka mode (figure 3), the communication between the eBFP agent and FLP is done via a Kafka topic.

![Netobserv component's architecture (Kafka mode)](./assets/architecture-kafka.png)
Figure 3: Kafka deployment

This has some advantages over the direct mode:
1. The flows' are buffered in the Kafka topic, so if there is a peak of flows, we make sure that
FLP will receive/process them without any kind of denial of service.
2. Flows are persisted in the topic, so if FLP is restarted by any reason (an update in the
configuration or just a crash), the forwarded flows are persisted in Kafka for its later
processing, and we don't lose them.
3. Deploying FLP as a deployment, you don't have to keep the 1:1 proportion. You can scale up and
down FLP pods according to your load.
# NetObserv architecture

_See also: [architecture in the downstream documentation](https://docs.openshift.com/container-platform/latest/observability/network_observability/understanding-network-observability-operator.html#network-observability-architecture_nw-network-observability-operator)_

NetObserv is a collection of components that can sometimes run independently, or as a whole.

The components are:

- An [eBPF agent](https://github.com/netobserv/netobserv-ebpf-agent), that generates network flows from captured packets.
- It is attached to any/all of the network interfaces in the host, and listens for packets (ingress+egress) with [eBPF](https://ebpf.io/).
- Packets are aggregated into logical flows (similar to NetFlows), periodically exported to a collector, generally FLP.
- Optional features allow to add rich data, such as TCP latency or DNS information.
- It is able to correlate those flows with other events such as network policy rules and drops (network policy correlation requires the [OVN Kubernetes](https://github.com/ovn-org/ovn-kubernetes/) network plugin).
- When used with the CLI or as a standalone, the agent can also do full packet captures instead of generating logical flows.
- [Flowlogs-pipeline](https://github.com/netobserv/flowlogs-pipeline) (FLP), a component that collects, enriches and exports these flows.
- It uses Kubernetes informers to enrich flows with details such as Pod names, namespaces, availability zones, etc.
- It derives all flows into metric counters, for Prometheus.
- Raw flows can be exported to Loki and/or custom exporters (Kafka, IPFIX, OpenTelemetry).
- As a standalone, FLP is very flexible and configurable. It supports more inputs and outputs, allows more arbitrary filters, sampling, aggregations, relabelling, etc. When deployed via the operator, only a subset of its capacities is used.
- When used in OpenShift, [a Console plugin](https://github.com/netobserv/network-observability-console-plugin) for flows visualization with powerful filtering options, a topology representation and more (outside of OpenShift, [it can be deployed as a standalone](https://github.com/netobserv/network-observability-operator/blob/main/FAQ.md#how-do-i-visualize-flows-and-metrics)).
- It provides a polished web UI to visualize and explore the flow logs and metrics stored in Loki and/or Prometheus.
- Different views include metrics overview, a network topology and a table listing raw flows logs.
- It supports multi-tenant access, making it relevant for various use cases: cluster/network admins, SREs, development teams...
- [An operator](https://github.com/netobserv/network-observability-operator) that manages all of the above.
- It provides two APIs (CRD), one called [FlowCollector](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowCollector.md), which configures and pilots the whole deployment, and another called [FlowMetrics](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowMetric.md) which allows to customize which metrics to generate out of flow logs.
- As an [OLM operator](https://olm.operatorframework.io/), it is designed with `operator-sdk`, and allows subscriptions for easy updates.
- [A CLI](https://github.com/netobserv/network-observability-cli) that also manages some of the above components, for on-demand monitoring and packet capture.
- It is provided as a `kubectl` or `oc` plugin, allowing to capture flows (similar to what the operator does, except it's on-demand and in the terminal), full packets (much like a `tcpdump` command) or metrics.
- It is also available via [Krew](https://krew.sigs.k8s.io/).
- It offers a live visualization via a TUI. For metrics, when used in OpenShift, it provides out-of-the-box dashboards.
- Check out the blog post: [Network observability on demand](https://developers.redhat.com/articles/2024/09/17/network-observability-demand#what_is_the_network_observability_cli_).

## Direct deployment model

When using the operator with `FlowCollector` `spec.deploymentModel` set to `Direct`, agents and FLP are both deployed per node (as `DaemonSets`). This is perfect for an assessment of the technology, suitable on small clusters, but isn't very memory efficient in large clusters as every instance of FLP ends up caching the same cluster information, which can be huge.

Note that Loki isn't managed by the operator and must be installed separately, such as with the Loki operator. Same goes with Prometheus and any custom receiver.

<!-- You can use https://mermaid.live/ to test it -->

```mermaid
flowchart TD
subgraph "for each node"
A[eBPF Agent] -->|generates flows| F[FLP]
end
F -. exports .-> E[(Kafka/Otlp/IPFIX)]
F -->|raw logs| L[(Loki)]
F -->|metrics| P[(Prometheus)]
C[Console plugin] <-->|fetches| L
C <-->|fetches| P
O[Operator] -->|manages| A
O -->|manages| F
O -->|manages| C
```

## Kafka deployment model

When using the operator with `FlowCollector` `spec.deploymentModel` set to `Kafka`, only the agents are deployed per node as a `DaemonSet`. FLP becomes a Kafka consumer that can be scaled independently. This is the recommended mode for large clusters, and is a more robust/resilient solution.

Like in `Direct` mode, data stores aren't managed by the operator. The same applies to the Kafka brokers and stores. You can check the Strimzi operator for that.

<!-- You can use https://mermaid.live/ to test it -->

```mermaid
flowchart TD
subgraph "for each node"
A[eBPF Agent]
end
A -->|produces flows| K[(Kafka)]
F[FLP] <-->|consumes| K
F -. exports .-> E[(Kafka/Otlp/IPFIX)]
F -->|raw logs| L[(Loki)]
F -->|metrics| P[(Prometheus)]
C[Console plugin] <-->|fetches| L
C <-->|fetches| P
O[Operator] -->|manages| A
O -->|manages| F
O -->|manages| C
```

## CLI

When using the CLI, the operator is not involved, which means you can use it without installing NetObserv as a whole. It uses a special mode of the eBPF agents that embeds FLP.

When running flows or packet capture, a collector Pod is deployed in addition to the agents. When capturing only metrics, the collector isn't deployed, and metrics are exposed directly from the agents, pulled by Prometheus.

<!-- You can use https://mermaid.live/ to test it -->

```mermaid
flowchart TD
subgraph "for each node"
A[eBPF Agent w/ embedded FLP]
end
A -->|generates flows or packets| C[Collector]
CL[CLI] -->|manages| A
CL -->|manages| C
A -..->|metrics| P[(Prometheus)]
```
Binary file removed docs/assets/architecture-direct.png
Binary file not shown.
Binary file removed docs/assets/architecture-kafka.png
Binary file not shown.
Binary file removed docs/assets/frontend.png
Binary file not shown.

0 comments on commit e4ea8aa

Please sign in to comment.