-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no plugin instance
error appears when renewing the cache (KONG_DB_CACHE_TTL parameter) with custom go plugin
#7148
Comments
Update, I have modified runloop/plugin_server/init.lua to do an output of the running_instances and I have the following and interesting log:
The sequence that is trying to find is exactly one less than the one that is currently stored in memory... This same behaviour happens for other requests that fails Maybe this can lead to something :D |
During the time when the pod just started and KONG_DB_CACHE_TTL arrives just the sequence 0 is used, I have added logs every time you use it and I see only one log per worker:
|
Hi @ealogar thanks for the detailed report. Can you tell which version of go-pdk are you using? is it the same on different Kong versions? |
Sure, |
Note that the "sequence number" is a different concept from the "instance id". The first is used to detect changes or reloads from the database, while the second is used to synchronize between Kong and the Go plugin server. Either when there's a database event (like a modification from the admin API), or after KONG_DB_CACHE_TTL, the I'm re-reviewing the code to ensure they're not mixed up somewhere. I see in your latest log there's no what happens if you touch the config via the admin API? would it fail too? |
Later I will test what you say but also keep in mind that the issue happens with 2.4.1 and it's alwaying saying that the instance id is 0... |
Look at the code at line 349 (in kong 2.3.3) @javierguerragiraldez
and the error message is
There is a capital N letter that is making the next code not to be run ... |
🤦 i was sure it was a fragile test, but there wasn't any "clean" way to do it... note that in 2.4 this check appears in both mp_rpc.lua and pb_rpc.lua can you try with |
Give me some hours for doing the tests, I can not do right now, I will let you know asap if that fixes and I will opened PR as well |
After a couple of hours ... the fix has worked!!! Currently I am doing the change for 2.4.1 (I have to change 3 files) and leave all night the load tests. Tomorrow I will opened a PR against master. |
great news! thanks for your work, and sorry for the blunder |
Summary
Recently I have been trying to upgrade from kong 2.0.4 to kong 2.4 (and kong 2.3.3 later) and It has happened a very weird behaviour in a custom go plugin (global plugin, called jwe, without service or route directly linked). I have also migrated the plugin to the new server standalone plugin that communicate with kong directly with protobuf without the go-pluginserver.
Everything has been done successfully.
This plugin does a decryption of an authorization header JWE (Json web encrypted token) and propagates to another plugin chains (your jwt plugin and headers-transform... basically). It's configured the first one in the plugin chain.
When i run a jmeter test with a single thread, everything is worked fine for several hours; that is, the global plugin is executed correctly and the bearer token is decrpted and propagated to jwt plugin.
But the problema arises after some time I run a jmeter test with several theads and a bit more load (50 100 threads).
We deploy the kong docker in a k8s pod deployment.
When the pod is started, everything is worked perfectly (with 500-800 tps) but after some time (deterministic time, I will explain later) some errors begin to appear:
The following errors appears in 2.3.3:
In kong 2.4, the instance id is always 0:
The deterministic time when this thing begin to happens is exactly KONG_DB_CACHE_TTL, initially i configured 1 hour, but after some times during a day seeing this behaviour my suspects make me changing to 30 minutes, 10 minutes and repeat the test..
When the pod start everything works wll until KONG_DB_CACHE_TTL time pass, then (I know that you refresh services, routes, plugins, in cache to avoid hitting ddbb) the upper errors begins to happen for a percentage of the requests (not all failed but a 10% of them or so)....
I imagine that when you rebuild the services/apis in the cache router something is corrupted with relation to the external plugin instances and the bridge that you use to communicate with the external plugins is giving invalid values... I keep on looking in the source code but until now I dont have more details.
If I go to the pod and kill the plugin directly, it is respawned correctly but the same error is still hapenning.
Only a restart of the kong process seems to work.
This is causing the 10% of the requests to not call the global plugin and terminate with failure, they are correctly redirected to the plugins in lua (jwt, for example) but they end with an error as the token is not decrypted ...
The weird thing is that is not happening for all the requests, many of them are being processed correctly by the plugin and if you reduce the load (with 1 thead in jmeter) all requests works.
I have tested to kill the go server (with the plugin code), kong is quickly spawning a new process but the error persists.
It may be a slightly difference between preloading the services, routes, plugins in the warmup of the workers than when refreshing 🗡️
Steps To Reproduce
Additional Details & Logs
It may be related with this #6500 @javierguerragiraldez if you need aditional details Just tell me
I am testing now a simple go-hello.go plugin to check if I can reproduce and discard a failure in the jwe plugin but as it is working well before the cache reload happens, I suspect that I will have the same result but I wanted to let you know in advance
The text was updated successfully, but these errors were encountered: