Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no plugin instance error appears when renewing the cache (KONG_DB_CACHE_TTL parameter) with custom go plugin #7148

Closed
ealogar opened this issue May 12, 2021 · 11 comments · Fixed by #7153

Comments

@ealogar
Copy link
Contributor

ealogar commented May 12, 2021

Summary

Recently I have been trying to upgrade from kong 2.0.4 to kong 2.4 (and kong 2.3.3 later) and It has happened a very weird behaviour in a custom go plugin (global plugin, called jwe, without service or route directly linked). I have also migrated the plugin to the new server standalone plugin that communicate with kong directly with protobuf without the go-pluginserver.
Everything has been done successfully.
This plugin does a decryption of an authorization header JWE (Json web encrypted token) and propagates to another plugin chains (your jwt plugin and headers-transform... basically). It's configured the first one in the plugin chain.

When i run a jmeter test with a single thread, everything is worked fine for several hours; that is, the global plugin is executed correctly and the bearer token is decrpted and propagated to jwt plugin.

But the problema arises after some time I run a jmeter test with several theads and a bit more load (50 100 threads).

We deploy the kong docker in a k8s pod deployment.

When the pod is started, everything is worked perfectly (with 500-800 tps) but after some time (deterministic time, I will explain later) some errors begin to appear:

The following errors appears in 2.3.3:

2021/05/12 13:54:53 [error] 27#0: *253919 [kong] init.lua:332 [jwe] no plugin instance 123499, client: 10.240.1.3, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "example.es"
2021/05/12 13:54:53 [error] 26#0: *253921 [kong] init.lua:332 [jwe] no plugin instance 123514, client: 10.240.2.28, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "example.es"

In kong 2.4, the instance id is always 0:

2021/05/12 11:52:21 [error] 28#0: *138170 [kong] mp_rpc.lua:308 [jwe] no plugin instance 0, client: 10.240.2.28, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "example.es"
2021/05/12 11:52:21 [error] 28#0: *138723 [kong] mp_rpc.lua:308 [jwe] no plugin instance 0, client: 10.240.2.28, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "example.es"

The deterministic time when this thing begin to happens is exactly KONG_DB_CACHE_TTL, initially i configured 1 hour, but after some times during a day seeing this behaviour my suspects make me changing to 30 minutes, 10 minutes and repeat the test..
When the pod start everything works wll until KONG_DB_CACHE_TTL time pass, then (I know that you refresh services, routes, plugins, in cache to avoid hitting ddbb) the upper errors begins to happen for a percentage of the requests (not all failed but a 10% of them or so)....

I imagine that when you rebuild the services/apis in the cache router something is corrupted with relation to the external plugin instances and the bridge that you use to communicate with the external plugins is giving invalid values... I keep on looking in the source code but until now I dont have more details.

If I go to the pod and kill the plugin directly, it is respawned correctly but the same error is still hapenning.
Only a restart of the kong process seems to work.

This is causing the 10% of the requests to not call the global plugin and terminate with failure, they are correctly redirected to the plugins in lua (jwt, for example) but they end with an error as the token is not decrypted ...
The weird thing is that is not happening for all the requests, many of them are being processed correctly by the plugin and if you reduce the load (with 1 thead in jmeter) all requests works.
I have tested to kill the go server (with the plugin code), kong is quickly spawning a new process but the error persists.

It may be a slightly difference between preloading the services, routes, plugins in the warmup of the workers than when refreshing 🗡️

Steps To Reproduce

  1. Run kong docker with version 2.4, 2.4.1 or 2.3.3, configure KONG_DB_CACHE_TTL to a deterministic value (1 hour, 10 minutes, 10 minutes...)
  2. Provision 300-400 services and routes
  3. Configure a go plugin globally (withou pluginserver) and jwt, request-transformed plugin for each service.
  4. Run jmeter with 50 threads/1 thread

Additional Details & Logs

It may be related with this #6500 @javierguerragiraldez if you need aditional details Just tell me

I am testing now a simple go-hello.go plugin to check if I can reproduce and discard a failure in the jwe plugin but as it is working well before the cache reload happens, I suspect that I will have the same result but I wanted to let you know in advance

@ealogar
Copy link
Contributor Author

ealogar commented May 12, 2021

Update, I have modified runloop/plugin_server/init.lua to do an output of the running_instances and I have the following and interesting log:

2021/05/12 16:10:35 [error] 26#0: *85638 [kong] init.lua:333 [jwe] no plugin instance 8996, client: 10.240.1.3, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "api.default-ci.eag.baikalplatform.es"
2021/05/12 16:10:35 [notice] 26#0: *85638 [kong] init.lua:334 [jwe] {
  ["plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d"] = {
    Config = {
      DBHost = "pgbouncer",
      DBName = "apigw",
      DBPassword = "zzzz",
      DBPort = 5432,
      DBUser = "apigw_eag_",
      PostBasePathRegexp = "/token-introspection/v[0-9]+",
      PostParam = "token"
    },
    conf = {
      __key__ = "plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d",
      __seq__ = 70752,
      dbhost = "pgbouncer",
      dbname = "apigw",
      dbpassword = "zzzz",
      dbport = 5432,
      dbuser = "apigw_eag_",
      postbasepathregexp = "/token-introspection/v[0-9]+",
      postparam = "token"
    },
    id = 8997,
    rpc = {
      msg_id = 248479,
      notifications_callbacks = {
        serverPid = <function 1>
      },
      pluginserver_pid = 32,
      socket_path = "/usr/local/kong/jwe.socket",
      <metatable> = <1>{
        __index = <table 1>,
        call = <function 2>,
        fix_mmap = <function 3>,
        new = <function 4>,
        notification = <function 5>,
        notifications_callbacks = {}
      }
    },
    seq = 70752
  }
}, client: 10.240.1.3, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "example.es"

The sequence that is trying to find is exactly one less than the one that is currently stored in memory... This same behaviour happens for other requests that fails

Maybe this can lead to something :D

@ealogar
Copy link
Contributor Author

ealogar commented May 12, 2021

During the time when the pod just started and KONG_DB_CACHE_TTL arrives just the sequence 0 is used, I have added logs every time you use it and I see only one log per worker:

2021/05/12 16:31:26 [notice] 26#0: *325 [kong] init.lua:83 [jwe] Adding a new instance as we are the first, client: 10.240.0.228, server: kong, request: "GET /jwe/healthz HTTP/1.1", host: "10.240.1.40:8000"
2021/05/12 16:31:26 [notice] 26#0: *325 [kong] init.lua:84 [jwe] {}, client: 10.240.0.228, server: kong, request: "GET /jwe/healthz HTTP/1.1", host: "10.240.1.40:8000"
2021/05/12 16:31:26 [notice] 26#0: *325 [kong] init.lua:86 [jwe] {
  ["plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d"] = {
    conf = {
      __key__ = "plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d",
      __seq__ = 0,
      dbhost = "pgbouncer",
      dbname = "apigw",
      dbpassword = "xxxx",
      dbport = 5432,
      dbuser = "apigw_eag",
      postbasepathregexp = "/token-introspection/v[0-9]+",
      postparam = "token"
    },
    seq = 0
  }
}, client: 10.240.0.228, server: kong, request: "GET /jwe/healthz HTTP/1.1", host: "10.240.1.40:8000"

10.240.0.228 - - [12/May/2021:16:31:26 +0000] "GET /jwe/healthz HTTP/1.1" 200 14 "-" "kube-probe/1.20" "-" "-"
2021/05/12 16:31:33 [notice] 27#0: *379 [kong] init.lua:83 [jwe] Adding a new instance as we are the first, client: 10.240.2.28, server: kong, request: "POST /token HTTP/1.1", host: "example.es"
2021/05/12 16:31:33 [notice] 27#0: *379 [kong] init.lua:84 [jwe] {}, client: 10.240.2.28, server: kong, request: "POST /token HTTP/1.1", host: "example.es"
2021/05/12 16:31:33 [notice] 27#0: *379 [kong] init.lua:86 [jwe] {
  ["plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d"] = {
    conf = {
      __key__ = "plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d",
      __seq__ = 0,
      dbhost = "pgbouncer",
      dbname = "apigw",
      dbpassword = "xxx",
      dbport = 5432,
      dbuser = "apigw_eag",
      postbasepathregexp = "/token-introspection/v[0-9]+",
      postparam = "token"
    },
    seq = 0
  }
}, client: 10.240.2.28, server: kong, request: "POST /token HTTP/1.1", host: "example.es"

@javierguerragiraldez
Copy link
Contributor

Hi @ealogar thanks for the detailed report. Can you tell which version of go-pdk are you using? is it the same on different Kong versions?

@ealogar
Copy link
Contributor Author

ealogar commented May 12, 2021

Sure,
github.com/Kong/go-pdk v0.6.1
the same in 2.4 2.4.1 and 2.3.3
with go.mod and go.sum as you teach me so good :D...

@javierguerragiraldez
Copy link
Contributor

javierguerragiraldez commented May 12, 2021

During the time when the pod just started and KONG_DB_CACHE_TTL arrives just the sequence 0 is used

Note that the "sequence number" is a different concept from the "instance id". The first is used to detect changes or reloads from the database, while the second is used to synchronize between Kong and the Go plugin server.

Either when there's a database event (like a modification from the admin API), or after KONG_DB_CACHE_TTL, the __seq__ changes and the instance must be refreshed; but there are many other reasons why the instance can be closed and recreated, incrementing the instance id without changing the __seq__, for example the plugin server could close idle instances after some time (default 60 sec) without traffic.

I'm re-reviewing the code to ensure they're not mixed up somewhere.

I see in your latest log there's no id or Config (capital 'C') fields, meaning that it hasn't registered the running instance from Go yet. It's probably just when asking the server to start an instance.

what happens if you touch the config via the admin API? would it fail too?
does your plugin receive traffic? could you try to verify the idle/reload process? for example, do one request to trigger the plugin, check the instance ID, wait 2 mins, and then do a new request (hopefully to the same nginx worker), and see if it starts a new instance.

@ealogar
Copy link
Contributor Author

ealogar commented May 12, 2021

Later I will test what you say but also keep in mind that the issue happens with 2.4.1 and it's alwaying saying that the instance id is 0...

@ealogar
Copy link
Contributor Author

ealogar commented May 12, 2021

Look at the code at line 349 (in kong 2.3.3) @javierguerragiraldez

 if string.match(err, "No plugin instance") then

and the error message is

init.lua:345 [jwe] no plugin instance 902,

There is a capital N letter that is making the next code not to be run ...

@javierguerragiraldez javierguerragiraldez added the task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not. label May 12, 2021
@javierguerragiraldez
Copy link
Contributor

🤦

i was sure it was a fragile test, but there wasn't any "clean" way to do it... note that in 2.4 this check appears in both mp_rpc.lua and pb_rpc.lua

can you try with if string.match(err:lower(), "no plugin instance") then to see if that fixes the issue? if so, a PR would be great!

@ealogar
Copy link
Contributor Author

ealogar commented May 12, 2021

Give me some hours for doing the tests, I can not do right now, I will let you know asap if that fixes and I will opened PR as well
After the first test I think that It's fixing the issue... but I will confirm later. It's clear that the code of retry the event was never being executed .... (and also in kong 2.4.1)

@ealogar
Copy link
Contributor Author

ealogar commented May 12, 2021

After a couple of hours ... the fix has worked!!!
A copule of hours with 8 restarts of the balancer to refresh cache and 0 failures in my load tests with 50 threads.

Currently I am doing the change for 2.4.1 (I have to change 3 files) and leave all night the load tests. Tomorrow I will opened a PR against master.
cc @javierguerragiraldez

@javierguerragiraldez javierguerragiraldez added task/bug and removed task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not. labels May 12, 2021
@javierguerragiraldez
Copy link
Contributor

great news! thanks for your work, and sorry for the blunder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants