Skip to content

Commit

Permalink
DOC-3101: arch doc review/copy-edit
Browse files Browse the repository at this point in the history
  • Loading branch information
dwdougherty committed Nov 13, 2023
1 parent 14a5018 commit 711b8ed
Showing 1 changed file with 30 additions and 30 deletions.
60 changes: 30 additions & 30 deletions content/rdi/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ headerRange: "[2]"
aliases:
---

Redis Data Integration (RDI) is a product that helps Redis Enterprise users ingest and export data in near real time.
Redis Data Integration (RDI) is a product that helps Redis Enterprise users ingest and export data in near real time. Its features include:

- End to end solution; no need for additional tools and integrations
- Capture Data Change (CDC) included
- Being an end-to-end solution; no need for additional tools and integrations
- Capture Data Change (CDC) is included
- Covers most popular databases as sources and targets
- Declarative data mapping and transformations; no custom code needed
- Data delivery guaranteed (at least once)
Expand All @@ -27,50 +27,49 @@ RDI currently supports two use cases:

![RDI components](/images/rdi/rdi-components.png)

RDI has several components, all of them deployed outside of the Redis Enterprise cluster. In addition RDI uses a small Redis database inside the cluster for staging data and storing state.
RDI has several components, all deployed outside a Redis Enterprise cluster. In addition, RDI uses a small Redis database inside the cluster for staging data and storing state.

RDI can be deployed as a K8s deployment or on two VMs.
RDI can be deployed as a Kubernetes (K8s) deployment or on two VMs.

### RDI operator

RDI operator is the main control plane component. It is in charge of spinning, configuring and watching RDI data plane components.
RDI operator is the main control plane component. It is in charge of spinning up, configuring, and watching RDI data plane components.

### RDI collectors

For the ingest scenario RDI operator will create and configure a collector. RDI collector is in charge of fetching a baseline snapshot of the source data and then tracking the data changes at the source.
For the ingest use case, RDI operator will create and configure a collector, which is in charge of fetching a baseline snapshot of the source data and then tracking the data changes at the source.
Currently, RDI comes with one type of collector, the `RDI Debezium Collector`. This collector is an orchestrated [Debezium Server](https://debezium.io/).

### RDI stream processor

In all use cases, the RDI stream processor is the main processor of data:
In all use cases, the RDI stream processor is the main processor of data. Its functions are:

- Reading collector supplied data from Redis streams.
- Applying transformations in order to translate the data from other models to Redis model (ingest) or vice versa from Redis to another model (write-behind).
- Applying transformations to translate the data from other data models to Redis data models (ingest), or vice versa, from Redis to another data model (write-behind).
- Connect to the target database / datastore and apply the data changes.

For more information about RDI jobs and transformations, read the data transformation section of these docs.
For a list of targets see the specific ingest and write behind sections below.
For more information about RDI jobs and transformations, read the [data transformation]({{< relref "/rdi/data-transformation/" >}}) section.
For a list of targets see the specific ingest and write-behind sections below.

### RDI metrics exporter

The RDI metrics exporter is a prometheus exporter that allows [prometheus](https://prometheus.io/) to scrape metrics measuring data processing and performance.
The RDI metrics exporter is a Prometheus exporter that allows [Prometheus](https://prometheus.io/) to scrape metrics measuring data processing and performance.

### RDI API server

The RDI API server exposes REST endpoints of RDI API.
The RDI API server exposes REST endpoints of the RDI API.

### RDI CLI

The RDI CLI provides a user interface to manage RDI. It uses the RDI API.


## Ingest functionality and architecture

You can think of RDI Ingest as a streaming ELT process, where
You can think of RDI ingest as a streaming ELT process, where

- Data is **E**xtracted from the source database using RDI Debezium Collector - an orchestrated [Debezium Server](https://debezium.io/)
- Data is then **L**oaded into RDI DB, a Redis database instance that keeps the data in [Redis streams](https://redis.io/docs/manual/data-types/streams/) alongside required metadata.
- Data is then **T**ransformed using RDI Stream Processor and written to the target Redis database.
- Data is **E**xtracted from the source database using the RDI Debezium Collector, an orchestrated [Debezium Server](https://debezium.io/)
- Data is then **L**oaded into the RDI database, a Redis database instance that keeps the data in [Redis streams](https://redis.io/docs/manual/data-types/streams/) together with required metadata.
- Data is then **T**ransformed using the RDI Stream Processor and written to the target Redis database.

RDI using Debezium Server works in two modes:

Expand All @@ -79,7 +78,6 @@ RDI using Debezium Server works in two modes:

![RDI data flow diagram](/images/rdi/rdi-ingest-data-flow.png)


### Supported data transformations

#### Model mapping
Expand All @@ -102,30 +100,32 @@ RDI supports declarative transformations to further manipulate the data, includi

### Secrets handling

RDI components requires access to secrets in order to access the source database (collector), the rdi database (collector, stream processor, operator, metrics exporter and API server) and the target database (stream processor).
RDI components require access to secrets in order to access the source database (collector), the rdi database (collector, stream processor, operator, metrics exporter, and API server) and the target database (stream processor).

RDI never keeps secrets in configuration files, instead secrets can be injected or pulled in the following ways:

[DWDOUGHERTY] MISSING CONTENT

### Scalability and high availability

RDI is highly available:
When deployed on Kubernetes, RDI components are all stateless pods managed by Kubernetes and the RDI operator.
When deployed on VMs, RDI use one VM as a failover (active-passive topology) with identical set of stateless components on each VM and an operator using a `Redlock` mechanism to ensure he owns the active set of RDI components.

All RDI state is stored in highly available manner using Redis database high availability and Kubernetes etcd.
- When deployed on Kubernetes, RDI components are all stateless pods managed by Kubernetes and the RDI operator.
- When deployed on VMs, RDI use one VM as a failover (active-passive topology) with identical sets of stateless components on each VM, and an operator using a `Redlock` mechanism to ensure it owns the active set of RDI components.

All RDI state is stored in a highly available manner using Redis database high availability and Kubernetes **etcd**.

RDI is scalable:

- Data is distributed to streams based on number of tables or even based on primary key.
- During initial load (ingesting the baseline snapshot) RDI stream processor can span multiple processes each one processing some of the data streams.
- During initial load (ingesting the baseline snapshot), the RDI stream processor can span multiple processes, each one processing a subset of the data streams.

### Deployment

RDI can be deployed on Kubernetes or on VMs:

- On Kubernetes, RDI works as a Kubernetes deployment managed by RDI operator. RDI operator is a pod watched by the cluster and responsible for orchestrating the other RDI components.
- On VMs, RDI is deployed on two VMs. Each VM has an RDI operator that can orchestrate the other RDI components. At any given time a single RDI orchestrator is the primary and hence this VM is the active part of the deployment. On the other VM the orchestrator stops the RDI components and they do not run.
- On Kubernetes, RDI works as a Kubernetes deployment managed by RDI operator. RDI operator is a pod watched by the cluster and is responsible for orchestrating the other RDI components.
- On VMs, RDI is deployed on two VMs. Each VM has an RDI operator that can orchestrate the other RDI components. At any given time a single RDI orchestrator is the primary and its VM is the active part of the deployment. On the other VM, the orchestrator stops the RDI components and they do not run.
![rdi active passive](/images/rdi/rdi-active-passive.png)

## Write-behind functionality and architecture
Expand All @@ -140,10 +140,10 @@ To learn more about write-behind declarative jobs and normalization, see the [wr

RDI's CLI and components come in one edition that can run both ingest and write-behind. However, the topology for write-behind is different.

- RDI Redis collector - This component run on RedisGears inside the application Redis database. It captures data change events and writes them into Redis streams.
- RDI stream processor - reads and process the data in the streams the same way it does for ingest data, however there are 2 main differences:
- RDI needs a job to be deployed in order to know how to map the data to a target table(s)
- RDI will use a specific writer (In most cases `relational.write`) in order to connect and apply the changes to the target.
- RDI Redis collector - This component runs on RedisGears inside the application Redis database. It captures data change events and writes them into Redis streams.
- RDI stream processor - reads and processes the data in the streams the same way it does for ingest data. However, there are 2 main differences:
- RDI needs a job to be deployed in order to know how to map the data to target tables.
- RDI will use a specific writer (`relational.write` in most cases) in order to connect and apply the changes to the target.

![write behind components](/images/rdi/rdi-write-behind.png)

Expand Down

0 comments on commit 711b8ed

Please sign in to comment.