-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting exactly once #7522
Comments
A related issue -- using the user-supplied query_id as a deduplication key for inserts: #7461 |
https://clickhouse.yandex/docs/en/operations/table_engines/replication/ Data blocks are deduplicated. For multiple writes of the same data block (data blocks of the same size containing the same rows in the same order), the block is only written once. The reason for this is in case of network failures when the client application doesn't know if the data was written to the DB, so the INSERT query can simply be repeated. It doesn't matter which replica INSERTs were sent to with identical data. INSERTs are idempotent. Deduplication parameters are controlled by merge_tree server settings. block's hashsums stored in the ZK. |
One important thing to note: de duplication happens on shard level. |
Although replicated engine could guarantee a block to be written exactly once, it's still different from the real |
A new proposal to redesign Kafka Engine could help to resolve this issue too |
Done |
Currently, data updating is used through
ReplacingMergeTree
engine. However, the update operation happens asynchronously through background merging thread, there are many cases where business wants read after write semantic(or nearly) and it can only happen afteroptimize final
is used right now, but it might causes the database to be blocked for a long time. On the other hand, when inserting data to ClickHouse, if the worker has corrupted and be restarted, duplication might happen unlessoptimize final
is called.For OLAP solution as
Apache Doris
, there's an important feature of exactly once, worked together withKafka
, it's the so calledDoris Stream Load
:For each
Doris Stream Load
http request, it could be attached a Label http header, andApache Doris
could guarantee data under same Label to be loaded only once within 7 days(a configurable duration), and errors would be reported for duplicated insertions. As a result, if the Label for the load request forApache Doris
(dorisDb_dorisTable_sequence_id) could be strictly aligned with the offsets from Kafka, then duplicate insertion from Kafka could be avoided.If this feature has also been implemented in ClickHouse,too, then the data inconsistency issue could be greatly relieved.
The text was updated successfully, but these errors were encountered: