Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(plugin): fix batch queue closure function update bug #10090

Closed
wants to merge 11 commits into from

Conversation

oowl
Copy link
Member

@oowl oowl commented Jan 10, 2023

Summary

kong opentelemetry plugin use Closure in here https://github.com/Kong/kong/blob/master/kong/plugins/opentelemetry/handler.lua#L151, this will cause config update not to be successful, because this closure function only captures once configure when create

Full changelog

Issue reference

FTI-4643

@pull-request-size pull-request-size bot added size/L and removed size/M labels Jan 10, 2023
@oowl oowl force-pushed the fix/batch_queue_conf branch from 6009e7e to 3b28b81 Compare January 11, 2023 08:47
@oowl oowl added this to the 3.2 milestone Jan 11, 2023
@oowl oowl force-pushed the fix/batch_queue_conf branch from 3b28b81 to 39ac379 Compare January 11, 2023 08:57
@oowl oowl force-pushed the fix/batch_queue_conf branch from 39ac379 to f52bfc8 Compare January 11, 2023 08:57
@oowl oowl requested a review from hanshuebner January 11, 2023 10:02
@@ -120,6 +122,13 @@ function DatadogHandler:log(conf)

local queue_id = get_queue_id(conf)
local q = queues[queue_id]
if q and not tx_deepcompare(conf, q.conf) then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we consider using a crud event handler to flush queue
upon plugin config change instead of doing deepcompare on
every request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems to be imprudent to copy and paste this logic into every plugin that uses queues.

@@ -153,6 +155,14 @@ function HttpLogHandler:log(conf)

local queue_id = get_queue_id(conf)
local q = queues[queue_id]
if q and not tx_deepcompare(conf, q.conf) then
local flush_res, err = q.queue:flush()
Copy link
Contributor

@hanshuebner hanshuebner Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what circumstance would the configuration change in ways that retains the queue key but still requires flushing? In other words, would a configuration change not normally recreate the queue anyway? Is there test coverage for this part of this code?

Copy link
Member

@kikito kikito Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTI-4643 is showing is that the problem was detected in the opentelemetry plugin, not on http-log.

I believe the problem is in how queue_id is calculated.

On http-log, get_queue_id is calculated using "all the values of the config":

local function get_queue_id(conf)
  return fmt("%s:%s:%s:%s:%s:%s:%s:%s",
             conf.http_endpoint,
             conf.method,
             conf.content_type,
             conf.timeout,
             conf.keepalive,
             conf.retry_count,
             conf.queue_size,
             conf.flush_timeout)
end

In other plugins we are returning __key__. In Datadog:

local function get_queue_id(conf)
  return conf.__key__
end

In opentracing:

local queue_id = conf.__key__

On http-log we essentially "serialize all the config" already in get_queue_id. If the config changes, queue_id will always change.

On the others we use conf.__key__. The plugins iterator will set that when iterating over a plugin for the first time (probably at creation time). It seems it is just an integer.

That means that an "old" version of opentelemetry may have the same "number" than a new version of the plugin with different config. If the new config is loaded via a reload, the Lua state will not be created, so queue[4] will be "kept the same".

My recommendation is: let's not use key and let's use something similar to what http_log is doing instead. I am not 100% sure that this is the solution, so I will leave this as a comment, not as a change request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If __key__ remains to be in use, it will be easier to see what's going on if the get_queue_id function would be replaced by queue_id = conf.__key__ like in the opentracing plugin. I was mislead by my reading of the http-log plugin, which uses some of the config attributes to construct the queue id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @kikito 's meaning is the same as mine.
@hanshuebner for HTTP log, I think this implementation has some problems because it uses all config fields as the key to store the queue, if we add another config in here, it needs to add this field in the batch queue map function, this is easy to forget, also concatenates the string is very expensive in Lua, I think we need a new way to solve this problem, I will use the crud event handler to flush queue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the string concatenation required to yield the queue id is contributing a lot to the overall performance, but I'll gladly see my scepticism be removed with some real numbers. My concern is mostly that get_queue_id, as it is defined in the http-log plugin, does something that is very different from what get_queue_id does in the opentracing plugin. The reuse of the (bad) function name in the http-log plugin led me to believe that a similar thing is done in the opentracing plugin. By removing the function and instead just using the __key__ field of the configuration, such confusion could be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanshuebner I think using the plugin update event is a better way to flush the batch queue, Wat do you think about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not misinformed, the only way to get a "plugin update event" is to compare the current configuration with the previous configuration in the access function. I think flushing the queue should only be done if queue parameters relevant to the upstream server fail. I think it would be suprising and unwanted to automatically flush the queues on all configuration changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I am wrong with the string, string.fmt function is not so expensive, "a" .. "b" this way string concatenation function is expensive.

Copy link
Member Author

@oowl oowl Jan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update worker event flush will be triggered when this plugin configure updated. So I think it's needed to flush the queue for the plugin queue.9ba3603#diff-1ff9a05ec152f46da66c4fc78f1d1f0a9e1604ed0bf530a71a6d4e9323fc3436R157

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing the queues on arbitrary reconfigure events is not acceptable.

I do not agree with these changes at this point, but I will not continue reviewing until Monday.

@oowl oowl force-pushed the fix/batch_queue_conf branch 2 times, most recently from 154e3fc to 9ba3603 Compare January 14, 2023 17:25
@oowl oowl force-pushed the fix/batch_queue_conf branch from 9ba3603 to 39ac379 Compare January 15, 2023 02:29
@hanshuebner
Copy link
Contributor

@dndx and I have met to discuss the need to redesign the queuing system, which will be happening in the next couple of weeks. With the redesign, the fix for this FTI would look different enough to make it questionable whether we want to roll it out like this. In particular, I have these objections:

  • A lot of code is copy&pasted between different plugins, with slight differences in semantics
  • Data is potentially lost when a configuration change is detected and queues are silently flushed
  • A full deep compare of the plugin configuration is done for each logged request

As indicated in FTI-4643, the customer currently knows a workaround to the actual problem they're seeing, so it seems not to be an urgent issue.

@oowl
Copy link
Member Author

oowl commented Jan 16, 2023

@hanshuebner I am very glad to see the new design, If you have the results please assign me. Now, this PR will hold on, and delete in 3.2 millstones.

@oowl oowl removed this from the 3.2 milestone Jan 16, 2023
@RobSerafini RobSerafini added this to the 3.2 milestone Jan 19, 2023
@hanshuebner hanshuebner removed this from the 3.2 milestone Jan 19, 2023
@oowl
Copy link
Member Author

oowl commented Feb 1, 2023

It seems #10172 can address this problem, So this PR needs to be closed.

@oowl oowl closed this Feb 1, 2023
@hbagdi hbagdi deleted the fix/batch_queue_conf branch March 22, 2023 22:15
curiositycasualty pushed a commit that referenced this pull request Oct 15, 2024
…o `0.1.2137` in `/spec-ee/kong-api-tests` (#10090)

Bumps [@kong/runtime-groups-api-client](https://github.com/Kong/khcp/tree/HEAD/packages/khcp-api-client) from 0.1.2111 to 0.1.2137.
- [Commits](https://github.com/Kong/khcp/commits/HEAD/packages/khcp-api-client)

---
updated-dependencies:
- dependency-name: "@kong/runtime-groups-api-client"
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Hayk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants