Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: health checker target leak #10636

Closed
ChuanFF opened this issue Dec 12, 2023 · 4 comments · Fixed by #10655
Closed

bug: health checker target leak #10636

ChuanFF opened this issue Dec 12, 2023 · 4 comments · Fixed by #10655
Assignees

Comments

@ChuanFF
Copy link

ChuanFF commented Dec 12, 2023

Current Behavior

When the upstream is updated/deleted, the upstream health status data of the Prometheus plug-in does not remove the corresponding upstream node.

image

Expected Behavior

No response

Error Logs

No response

Steps to Reproduce

  1. create uptream & route
  2. request prometheus uri
  3. remove upstream
  4. req prometheus uri , will found upstream info still here

Environment

  • APISIX version (run apisix version): 2.13.1
  • Operating system (run uname -a):Linux pekshcsitd54867 3.10.0-1160.76.1.el7.x86_64 change: added doc of how to load plugin. #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • OpenResty / Nginx version (run openresty -V or nginx -V):nginx version: openresty/1.21.4.1
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):{"boot_time":1702379993,"etcd_version":"3.5.0","id":"aa6db183-43e6-4f8d-a86d-6ed48e76dbaa","hostname":"pekshcsitd54867","version":"2.13.1"}
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):
@Sn0rt
Copy link
Contributor

Sn0rt commented Dec 13, 2023

yep. this API will expose the health checker internal status (include mostly_healthy )

I this the metric import by this commit
809ba09

....
    -- update upstream_status metrics
    local stats = control.get_health_checkers()
    for _, stat in ipairs(stats) do
        for _, node in ipairs(stat.nodes) do
            metrics.upstream_status:set((node.status == "healthy") and 1 or 0,
                gen_arr(stat.name, node.ip, node.port))
        end
    end

    core.response.set_header("content_type", "text/plain")
    return 200, core.table.concat(prometheus:metric_data())
end
        handler = _M.dump_plugin_metadata,
    },
    get_health_checkers = _get_health_checkers,
}
local function _get_health_checkers()
    local infos = {}
    local routes = get_routes()
    iter_and_add_healthcheck_info(infos, routes)
    local services = get_services()
    iter_and_add_healthcheck_info(infos, services)
    local upstreams = get_upstreams()
    iter_and_add_healthcheck_info(infos, upstreams)
    return infos
end
local function iter_and_add_healthcheck_info(infos, values)
    if not values then
        return
    end

    for _, value in core.config_util.iterate_values(values) do
        local checks = value.value.checks or (value.value.upstream and value.value.upstream.checks)
        if checks then
            local info = extra_checker_info(value)
...
local function extra_checker_info(value)
    if not healthcheck then
        healthcheck = require("resty.healthcheck")
    end

    local name = upstream_mod.get_healthchecker_name(value)
    local nodes, err = healthcheck.get_target_list(name, "upstream-healthcheck")
    if err then
        core.log.error("healthcheck.get_target_list failed: ", err)
    end
    return {
        name = value.key,
        nodes = nodes,
    }
end
function _M.get_target_list(name, shm_name)
  local self = {
    name = name,
    shm_name = shm_name,
    log = checker.log,
  }
  self.shm = ngx.shared[tostring(shm_name)]
  assert(self.shm, ("no shm found by name '%s'"):format(shm_name))
  self.TARGET_STATE     = SHM_PREFIX .. self.name .. ":state"
  self.TARGET_COUNTER   = SHM_PREFIX .. self.name .. ":counter"
  self.TARGET_LIST      = SHM_PREFIX .. self.name .. ":target_list"
  self.TARGET_LIST_LOCK = SHM_PREFIX .. self.name .. ":target_list_lock"
  self.LOG_PREFIX       = LOG_PREFIX .. "(" .. self.name .. ") "

  local ok, err = locking_target_list(self, function(target_list)
    self.targets = target_list
    for _, target in ipairs(self.targets) do
      local state_key = key_for(self.TARGET_STATE, target.ip, target.port, target.hostname)
      target.status = INTERNAL_STATES[self.shm:get(state_key)]
      if not target.hostheader then
        target.hostheader = nil
      end
    end

    return true
  end)
....
  return self.targets
end

@moonming moonming moved this to 👀 In review in Apache APISIX backlog Dec 14, 2023
@Sn0rt
Copy link
Contributor

Sn0rt commented Dec 14, 2023

I have add a new test case for question 2.

image

After testing, I found that the target of this health chcker have some bug

  1. After the health check is run, I remove a node from the upstream of the route and find that the metric of the removed node has not been removed.
  2. After the health check is run, I add a node from the upstream of the route and find that the metric of the newly added node has not been increased.
  3. After the health check was run, I deleted the entire route and found that the protected nodes in the route were not deleted.

For implementation, you can refer to the test case included in the PR below. The reason why it has not been fixed yet is because the implementation logic of etcd is relatively complex. I am still evaluating the code design plan.

@Sn0rt
Copy link
Contributor

Sn0rt commented Dec 18, 2023

duplicated: #10500

EDIT: the above mentioned issue is not a duplicate

@monkeyDluffy6017 monkeyDluffy6017 changed the title bug: Prometheus plugin issues with removed upstream bug: health checker target leak Dec 18, 2023
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Apache APISIX backlog Dec 25, 2023
@shreemaan-abhishek
Copy link
Contributor

shreemaan-abhishek commented Mar 14, 2024

The PR that fixed this bug has been reverted as it significantly increased CPU load. Another way to fix this bug would be to configure metrics expiry:

# expire: 0 # The expiration time after metrics become inactive, unit: second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants