`no plugin instance` error appears when renewing the cache (KONG_DB_CACHE_TTL parameter) with custom go plugin #7148

ealogar · 2021-05-12T15:42:37Z

Summary

Recently I have been trying to upgrade from kong 2.0.4 to kong 2.4 (and kong 2.3.3 later) and It has happened a very weird behaviour in a custom go plugin (global plugin, called jwe, without service or route directly linked). I have also migrated the plugin to the new server standalone plugin that communicate with kong directly with protobuf without the go-pluginserver.
Everything has been done successfully.
This plugin does a decryption of an authorization header JWE (Json web encrypted token) and propagates to another plugin chains (your jwt plugin and headers-transform... basically). It's configured the first one in the plugin chain.

When i run a jmeter test with a single thread, everything is worked fine for several hours; that is, the global plugin is executed correctly and the bearer token is decrpted and propagated to jwt plugin.

But the problema arises after some time I run a jmeter test with several theads and a bit more load (50 100 threads).

We deploy the kong docker in a k8s pod deployment.

When the pod is started, everything is worked perfectly (with 500-800 tps) but after some time (deterministic time, I will explain later) some errors begin to appear:

The following errors appears in 2.3.3:

2021/05/12 13:54:53 [error] 27#0: *253919 [kong] init.lua:332 [jwe] no plugin instance 123499, client: 10.240.1.3, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "example.es"
2021/05/12 13:54:53 [error] 26#0: *253921 [kong] init.lua:332 [jwe] no plugin instance 123514, client: 10.240.2.28, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "example.es"

In kong 2.4, the instance id is always 0:

2021/05/12 11:52:21 [error] 28#0: *138170 [kong] mp_rpc.lua:308 [jwe] no plugin instance 0, client: 10.240.2.28, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "example.es"
2021/05/12 11:52:21 [error] 28#0: *138723 [kong] mp_rpc.lua:308 [jwe] no plugin instance 0, client: 10.240.2.28, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "example.es"

The deterministic time when this thing begin to happens is exactly KONG_DB_CACHE_TTL, initially i configured 1 hour, but after some times during a day seeing this behaviour my suspects make me changing to 30 minutes, 10 minutes and repeat the test..
When the pod start everything works wll until KONG_DB_CACHE_TTL time pass, then (I know that you refresh services, routes, plugins, in cache to avoid hitting ddbb) the upper errors begins to happen for a percentage of the requests (not all failed but a 10% of them or so)....

I imagine that when you rebuild the services/apis in the cache router something is corrupted with relation to the external plugin instances and the bridge that you use to communicate with the external plugins is giving invalid values... I keep on looking in the source code but until now I dont have more details.

If I go to the pod and kill the plugin directly, it is respawned correctly but the same error is still hapenning.
Only a restart of the kong process seems to work.

This is causing the 10% of the requests to not call the global plugin and terminate with failure, they are correctly redirected to the plugins in lua (jwt, for example) but they end with an error as the token is not decrypted ...
The weird thing is that is not happening for all the requests, many of them are being processed correctly by the plugin and if you reduce the load (with 1 thead in jmeter) all requests works.
I have tested to kill the go server (with the plugin code), kong is quickly spawning a new process but the error persists.

It may be a slightly difference between preloading the services, routes, plugins in the warmup of the workers than when refreshing 🗡️

Steps To Reproduce

Run kong docker with version 2.4, 2.4.1 or 2.3.3, configure KONG_DB_CACHE_TTL to a deterministic value (1 hour, 10 minutes, 10 minutes...)
Provision 300-400 services and routes
Configure a go plugin globally (withou pluginserver) and jwt, request-transformed plugin for each service.
Run jmeter with 50 threads/1 thread

Additional Details & Logs

It may be related with this #6500 @javierguerragiraldez if you need aditional details Just tell me

I am testing now a simple go-hello.go plugin to check if I can reproduce and discard a failure in the jwe plugin but as it is working well before the cache reload happens, I suspect that I will have the same result but I wanted to let you know in advance

The text was updated successfully, but these errors were encountered:

ealogar · 2021-05-12T16:13:37Z

Update, I have modified runloop/plugin_server/init.lua to do an output of the running_instances and I have the following and interesting log:

2021/05/12 16:10:35 [error] 26#0: *85638 [kong] init.lua:333 [jwe] no plugin instance 8996, client: 10.240.1.3, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "api.default-ci.eag.baikalplatform.es"
2021/05/12 16:10:35 [notice] 26#0: *85638 [kong] init.lua:334 [jwe] {
  ["plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d"] = {
    Config = {
      DBHost = "pgbouncer",
      DBName = "apigw",
      DBPassword = "zzzz",
      DBPort = 5432,
      DBUser = "apigw_eag_",
      PostBasePathRegexp = "/token-introspection/v[0-9]+",
      PostParam = "token"
    },
    conf = {
      __key__ = "plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d",
      __seq__ = 70752,
      dbhost = "pgbouncer",
      dbname = "apigw",
      dbpassword = "zzzz",
      dbport = 5432,
      dbuser = "apigw_eag_",
      postbasepathregexp = "/token-introspection/v[0-9]+",
      postparam = "token"
    },
    id = 8997,
    rpc = {
      msg_id = 248479,
      notifications_callbacks = {
        serverPid = <function 1>
      },
      pluginserver_pid = 32,
      socket_path = "/usr/local/kong/jwe.socket",
      <metatable> = <1>{
        __index = <table 1>,
        call = <function 2>,
        fix_mmap = <function 3>,
        new = <function 4>,
        notification = <function 5>,
        notifications_callbacks = {}
      }
    },
    seq = 70752
  }
}, client: 10.240.1.3, server: kong, request: "POST /mockbin/v1/echo HTTP/1.1", host: "example.es"

The sequence that is trying to find is exactly one less than the one that is currently stored in memory... This same behaviour happens for other requests that fails

Maybe this can lead to something :D

ealogar · 2021-05-12T16:39:21Z

During the time when the pod just started and KONG_DB_CACHE_TTL arrives just the sequence 0 is used, I have added logs every time you use it and I see only one log per worker:

2021/05/12 16:31:26 [notice] 26#0: *325 [kong] init.lua:83 [jwe] Adding a new instance as we are the first, client: 10.240.0.228, server: kong, request: "GET /jwe/healthz HTTP/1.1", host: "10.240.1.40:8000"
2021/05/12 16:31:26 [notice] 26#0: *325 [kong] init.lua:84 [jwe] {}, client: 10.240.0.228, server: kong, request: "GET /jwe/healthz HTTP/1.1", host: "10.240.1.40:8000"
2021/05/12 16:31:26 [notice] 26#0: *325 [kong] init.lua:86 [jwe] {
  ["plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d"] = {
    conf = {
      __key__ = "plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d",
      __seq__ = 0,
      dbhost = "pgbouncer",
      dbname = "apigw",
      dbpassword = "xxxx",
      dbport = 5432,
      dbuser = "apigw_eag",
      postbasepathregexp = "/token-introspection/v[0-9]+",
      postparam = "token"
    },
    seq = 0
  }
}, client: 10.240.0.228, server: kong, request: "GET /jwe/healthz HTTP/1.1", host: "10.240.1.40:8000"

10.240.0.228 - - [12/May/2021:16:31:26 +0000] "GET /jwe/healthz HTTP/1.1" 200 14 "-" "kube-probe/1.20" "-" "-"
2021/05/12 16:31:33 [notice] 27#0: *379 [kong] init.lua:83 [jwe] Adding a new instance as we are the first, client: 10.240.2.28, server: kong, request: "POST /token HTTP/1.1", host: "example.es"
2021/05/12 16:31:33 [notice] 27#0: *379 [kong] init.lua:84 [jwe] {}, client: 10.240.2.28, server: kong, request: "POST /token HTTP/1.1", host: "example.es"
2021/05/12 16:31:33 [notice] 27#0: *379 [kong] init.lua:86 [jwe] {
  ["plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d"] = {
    conf = {
      __key__ = "plugins:jwe:::::65751388-d870-4bcb-8487-e830b820d96d",
      __seq__ = 0,
      dbhost = "pgbouncer",
      dbname = "apigw",
      dbpassword = "xxx",
      dbport = 5432,
      dbuser = "apigw_eag",
      postbasepathregexp = "/token-introspection/v[0-9]+",
      postparam = "token"
    },
    seq = 0
  }
}, client: 10.240.2.28, server: kong, request: "POST /token HTTP/1.1", host: "example.es"

javierguerragiraldez · 2021-05-12T16:41:15Z

Hi @ealogar thanks for the detailed report. Can you tell which version of go-pdk are you using? is it the same on different Kong versions?

ealogar · 2021-05-12T16:47:24Z

Sure,
github.com/Kong/go-pdk v0.6.1
the same in 2.4 2.4.1 and 2.3.3
with go.mod and go.sum as you teach me so good :D...

javierguerragiraldez · 2021-05-12T16:56:16Z

During the time when the pod just started and KONG_DB_CACHE_TTL arrives just the sequence 0 is used

Note that the "sequence number" is a different concept from the "instance id". The first is used to detect changes or reloads from the database, while the second is used to synchronize between Kong and the Go plugin server.

Either when there's a database event (like a modification from the admin API), or after KONG_DB_CACHE_TTL, the __seq__ changes and the instance must be refreshed; but there are many other reasons why the instance can be closed and recreated, incrementing the instance id without changing the __seq__, for example the plugin server could close idle instances after some time (default 60 sec) without traffic.

I'm re-reviewing the code to ensure they're not mixed up somewhere.

I see in your latest log there's no id or Config (capital 'C') fields, meaning that it hasn't registered the running instance from Go yet. It's probably just when asking the server to start an instance.

what happens if you touch the config via the admin API? would it fail too?
does your plugin receive traffic? could you try to verify the idle/reload process? for example, do one request to trigger the plugin, check the instance ID, wait 2 mins, and then do a new request (hopefully to the same nginx worker), and see if it starts a new instance.

ealogar · 2021-05-12T17:00:13Z

Later I will test what you say but also keep in mind that the issue happens with 2.4.1 and it's alwaying saying that the instance id is 0...

ealogar · 2021-05-12T17:04:57Z

Look at the code at line 349 (in kong 2.3.3) @javierguerragiraldez

 if string.match(err, "No plugin instance") then

and the error message is

init.lua:345 [jwe] no plugin instance 902,

There is a capital N letter that is making the next code not to be run ...

javierguerragiraldez · 2021-05-12T17:18:49Z

🤦

i was sure it was a fragile test, but there wasn't any "clean" way to do it... note that in 2.4 this check appears in both mp_rpc.lua and pb_rpc.lua

can you try with if string.match(err:lower(), "no plugin instance") then to see if that fixes the issue? if so, a PR would be great!

ealogar · 2021-05-12T17:24:06Z

Give me some hours for doing the tests, I can not do right now, I will let you know asap if that fixes and I will opened PR as well
After the first test I think that It's fixing the issue... but I will confirm later. It's clear that the code of retry the event was never being executed .... (and also in kong 2.4.1)

ealogar · 2021-05-12T20:55:06Z

After a couple of hours ... the fix has worked!!!
A copule of hours with 8 restarts of the balancer to refresh cache and 0 failures in my load tests with 50 threads.

Currently I am doing the change for 2.4.1 (I have to change 3 files) and leave all night the load tests. Tomorrow I will opened a PR against master.
cc @javierguerragiraldez

javierguerragiraldez · 2021-05-12T21:04:18Z

great news! thanks for your work, and sorry for the blunder

javierguerragiraldez added the core/language/go label May 12, 2021

javierguerragiraldez added the task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not. label May 12, 2021

javierguerragiraldez added task/bug and removed task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not. labels May 12, 2021

ealogar mentioned this issue May 13, 2021

fix(external-plugins): use lowercase for comparing string in error condition #7153

Merged

dndx closed this as completed in #7153 May 19, 2021

mong0520 mentioned this issue Jan 18, 2024

100% CPU usage and keep posting event 'reset_instances' to go plugin server #12364

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`no plugin instance` error appears when renewing the cache (KONG_DB_CACHE_TTL parameter) with custom go plugin #7148

`no plugin instance` error appears when renewing the cache (KONG_DB_CACHE_TTL parameter) with custom go plugin #7148

ealogar commented May 12, 2021 •

edited

Loading

ealogar commented May 12, 2021 •

edited

Loading

ealogar commented May 12, 2021

javierguerragiraldez commented May 12, 2021

ealogar commented May 12, 2021 •

edited

Loading

javierguerragiraldez commented May 12, 2021 •

edited

Loading

ealogar commented May 12, 2021

ealogar commented May 12, 2021 •

edited

Loading

javierguerragiraldez commented May 12, 2021

ealogar commented May 12, 2021 •

edited

Loading

ealogar commented May 12, 2021

javierguerragiraldez commented May 12, 2021

no plugin instance error appears when renewing the cache (KONG_DB_CACHE_TTL parameter) with custom go plugin #7148

no plugin instance error appears when renewing the cache (KONG_DB_CACHE_TTL parameter) with custom go plugin #7148

Comments

ealogar commented May 12, 2021 • edited Loading

Summary

Steps To Reproduce

Additional Details & Logs

ealogar commented May 12, 2021 • edited Loading

ealogar commented May 12, 2021

javierguerragiraldez commented May 12, 2021

ealogar commented May 12, 2021 • edited Loading

javierguerragiraldez commented May 12, 2021 • edited Loading

ealogar commented May 12, 2021

ealogar commented May 12, 2021 • edited Loading

javierguerragiraldez commented May 12, 2021

ealogar commented May 12, 2021 • edited Loading

ealogar commented May 12, 2021

javierguerragiraldez commented May 12, 2021

`no plugin instance` error appears when renewing the cache (KONG_DB_CACHE_TTL parameter) with custom go plugin #7148

`no plugin instance` error appears when renewing the cache (KONG_DB_CACHE_TTL parameter) with custom go plugin #7148

ealogar commented May 12, 2021 •

edited

Loading

ealogar commented May 12, 2021 •

edited

Loading

ealogar commented May 12, 2021 •

edited

Loading

javierguerragiraldez commented May 12, 2021 •

edited

Loading

ealogar commented May 12, 2021 •

edited

Loading

ealogar commented May 12, 2021 •

edited

Loading