bug: health checker target leak #10636

ChuanFF · 2023-12-12T11:29:10Z

Current Behavior

When the upstream is updated/deleted, the upstream health status data of the Prometheus plug-in does not remove the corresponding upstream node.

Expected Behavior

No response

Error Logs

No response

Steps to Reproduce

create uptream & route
request prometheus uri
remove upstream
req prometheus uri , will found upstream info still here

Environment

APISIX version (run apisix version): 2.13.1
Operating system (run uname -a):Linux pekshcsitd54867 3.10.0-1160.76.1.el7.x86_64 change: added doc of how to load plugin. #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
OpenResty / Nginx version (run openresty -V or nginx -V):nginx version: openresty/1.21.4.1
etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):{"boot_time":1702379993,"etcd_version":"3.5.0","id":"aa6db183-43e6-4f8d-a86d-6ed48e76dbaa","hostname":"pekshcsitd54867","version":"2.13.1"}
APISIX Dashboard version, if relevant:
Plugin runner version, for issues related to plugin runners:
LuaRocks version, for installation issues (run luarocks --version):

The text was updated successfully, but these errors were encountered:

Sn0rt · 2023-12-13T01:09:58Z

yep. this API will expose the health checker internal status (include mostly_healthy )

I this the metric import by this commit
809ba09

....
    -- update upstream_status metrics
    local stats = control.get_health_checkers()
    for _, stat in ipairs(stats) do
        for _, node in ipairs(stat.nodes) do
            metrics.upstream_status:set((node.status == "healthy") and 1 or 0,
                gen_arr(stat.name, node.ip, node.port))
        end
    end

    core.response.set_header("content_type", "text/plain")
    return 200, core.table.concat(prometheus:metric_data())
end

        handler = _M.dump_plugin_metadata,
    },
    get_health_checkers = _get_health_checkers,
}

local function _get_health_checkers()
    local infos = {}
    local routes = get_routes()
    iter_and_add_healthcheck_info(infos, routes)
    local services = get_services()
    iter_and_add_healthcheck_info(infos, services)
    local upstreams = get_upstreams()
    iter_and_add_healthcheck_info(infos, upstreams)
    return infos
end

local function iter_and_add_healthcheck_info(infos, values)
    if not values then
        return
    end

    for _, value in core.config_util.iterate_values(values) do
        local checks = value.value.checks or (value.value.upstream and value.value.upstream.checks)
        if checks then
            local info = extra_checker_info(value)
...

local function extra_checker_info(value)
    if not healthcheck then
        healthcheck = require("resty.healthcheck")
    end

    local name = upstream_mod.get_healthchecker_name(value)
    local nodes, err = healthcheck.get_target_list(name, "upstream-healthcheck")
    if err then
        core.log.error("healthcheck.get_target_list failed: ", err)
    end
    return {
        name = value.key,
        nodes = nodes,
    }
end

function _M.get_target_list(name, shm_name)
  local self = {
    name = name,
    shm_name = shm_name,
    log = checker.log,
  }
  self.shm = ngx.shared[tostring(shm_name)]
  assert(self.shm, ("no shm found by name '%s'"):format(shm_name))
  self.TARGET_STATE     = SHM_PREFIX .. self.name .. ":state"
  self.TARGET_COUNTER   = SHM_PREFIX .. self.name .. ":counter"
  self.TARGET_LIST      = SHM_PREFIX .. self.name .. ":target_list"
  self.TARGET_LIST_LOCK = SHM_PREFIX .. self.name .. ":target_list_lock"
  self.LOG_PREFIX       = LOG_PREFIX .. "(" .. self.name .. ") "

  local ok, err = locking_target_list(self, function(target_list)
    self.targets = target_list
    for _, target in ipairs(self.targets) do
      local state_key = key_for(self.TARGET_STATE, target.ip, target.port, target.hostname)
      target.status = INTERNAL_STATES[self.shm:get(state_key)]
      if not target.hostheader then
        target.hostheader = nil
      end
    end

    return true
  end)
....
  return self.targets
end

Sn0rt · 2023-12-14T04:12:59Z

I have add a new test case for question 2.

After testing, I found that the target of this health chcker have some bug

After the health check is run, I remove a node from the upstream of the route and find that the metric of the removed node has not been removed.
After the health check is run, I add a node from the upstream of the route and find that the metric of the newly added node has not been increased.
After the health check was run, I deleted the entire route and found that the protected nodes in the route were not deleted.

For implementation, you can refer to the test case included in the PR below. The reason why it has not been fixed yet is because the implementation logic of etcd is relatively complex. I am still evaluating the code design plan.

Sn0rt · 2023-12-18T07:29:15Z

duplicated: #10500

EDIT: the above mentioned issue is not a duplicate

shreemaan-abhishek · 2024-03-14T08:46:28Z

The PR that fixed this bug has been reverted as it significantly increased CPU load. Another way to fix this bug would be to configure metrics expiry:

apisix/conf/config-default.yaml

Line 608 in ee2a759

    
           # expire: 0                       # The expiration time after metrics become inactive, unit: second.

github-project-automation bot added this to Apache APISIX backlog Dec 12, 2023

bzp2010 assigned Sn0rt Dec 13, 2023

Sn0rt mentioned this issue Dec 13, 2023

fix: mostly_healthy is healthy #10639

Merged

5 tasks

moonming moved this to 👀 In review in Apache APISIX backlog Dec 14, 2023

Sn0rt mentioned this issue Dec 15, 2023

fix: the leak of prometheus metrics #10655

Merged

5 tasks

monkeyDluffy6017 changed the title ~~bug: Prometheus plugin issues with removed upstream~~ bug: health checker target leak Dec 18, 2023

monkeyDluffy6017 closed this as completed in #10655 Dec 25, 2023

github-project-automation bot moved this from 👀 In review to ✅ Done in Apache APISIX backlog Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: health checker target leak #10636

bug: health checker target leak #10636

ChuanFF commented Dec 12, 2023

Sn0rt commented Dec 13, 2023 •

edited

Loading

Sn0rt commented Dec 14, 2023 •

edited

Loading

Sn0rt commented Dec 18, 2023 •

edited by shreemaan-abhishek

Loading

shreemaan-abhishek commented Mar 14, 2024 •

edited

Loading

bug: health checker target leak #10636

bug: health checker target leak #10636

Comments

ChuanFF commented Dec 12, 2023

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

Sn0rt commented Dec 13, 2023 • edited Loading

Sn0rt commented Dec 14, 2023 • edited Loading

Sn0rt commented Dec 18, 2023 • edited by shreemaan-abhishek Loading

shreemaan-abhishek commented Mar 14, 2024 • edited Loading

Sn0rt commented Dec 13, 2023 •

edited

Loading

Sn0rt commented Dec 14, 2023 •

edited

Loading

Sn0rt commented Dec 18, 2023 •

edited by shreemaan-abhishek

Loading

shreemaan-abhishek commented Mar 14, 2024 •

edited

Loading