Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking query of the health API "/health/service/<service>" may return empty nodes for non-empty services. #20790

Closed
mechpen opened this issue Mar 2, 2024 · 0 comments · Fixed by #20876

Comments

@mechpen
Copy link

mechpen commented Mar 2, 2024

Overview of the Issue

For example, a GET call of the API /health/service/consul?index=9999999&wait=10s may return status_code=200 index=1 nodes=[] when the following event 2 happens immediately after event 1:

  1. The consul token used in the request is revoked (for example, by the vault consul secret engine).
  2. 10 seconds deadline reached,

This is due to a race condition of handling errors caused by the above two events:

  1. Error handling of the first event is at here. In line 86, m.mat.reset() resets the index.
  2. Error handling of the second event is at here. In line 133, m.index=0 due to the above error handling. So the result.Value is set to an empty list.

Reproduction Steps

  1. Create a consul config file consul.json:
{
  "server": true,
  "bootstrap" : true,
  "datacenter": "dc1",
  "data_dir": "/tmp/consul",
  "client_addr": "127.0.0.1",
  "advertise_addr": "127.0.0.1",
  "acl": {
    "enabled": true,
    "default_policy": "deny"
  }
}
  1. Create a python script reproduce.py:
import os
import json
import time
import subprocess
from multiprocessing import Process

import requests

CONSUL_HTTP_ADDR = os.getenv("CONSUL_HTTP_ADDR")

def create_token():
    cmd = "consul acl token create -policy-name=global-management -format=json"
    out = subprocess.check_output(cmd.split())
    return json.loads(out)

def delete_token(token):
    cmd = f"consul acl token delete -accessor-id={token['AccessorID']}"
    subprocess.check_call(cmd.split())

def get_consul_nodes(token, wait):
    url = CONSUL_HTTP_ADDR + "/v1/health/service/consul"
    url += f"?index=99999999&wait={wait}"
    headers = {
        "x-consul-token": token["SecretID"],
    }

    ret = requests.get(url, headers=headers)
    if ret.status_code == 200:
        nodes = ret.json()
        if len(nodes) == 0:
            print(f"=== status_code={ret.status_code} index={ret.headers['X-Consul-Index']} nodes={nodes}")

def main():
    token = create_token()
    delays = range(10)
    blocking_wait_seconds = 10

    def f(d):
        time.sleep(d/len(delays))
        get_consul_nodes(token, f"{blocking_wait_seconds}s")

    procs = []
    for d in delays:
        p = Process(target=f, args=(d,))
        p.start()
        procs.append(p)

    time.sleep(blocking_wait_seconds)
    delete_token(token)

    for p in procs:
        p.join()
  1. Start consul server
$ consul agent -config-file consul.json
  1. Run the reproduce script:
$ consul acl bootstrap
...
SecretID:         xxx
Policies:
   00000000-0000-0000-0000-000000000001 - global-management
...
$ export CONSUL_HTTP_ADDR=http://127.0.0.1:8500
$ export CONSUL_HTTP_TOKEN=xxx
$ python reproduce.py
=== status_code=200 index=1 nodes=[]
=== status_code=200 index=1 nodes=[]
=== status_code=200 index=1 nodes=[]
=== status_code=200 index=1 nodes=[]
=== status_code=200 index=1 nodes=[]
=== status_code=200 index=1 nodes=[]
=== status_code=200 index=1 nodes=[]
=== status_code=200 index=1 nodes=[]
=== status_code=200 index=1 nodes=[]

Consul info

Reproduced in consul 1.11.7. Also reproduced in the main branch commit: 6f48ce1

Operating system and Environment details

Arch Linux Kernel v6.6.6

Log Fragments

2024-03-01T15:37:06.262-0800 [ERROR] agent.rpcclient.health: subscribe call failed: err="stream reset requested" failure_count=1 key=consul m=0xc001761ae0 topic=ServiceHealth
2024-03-01T15:37:12.451-0800 [ERROR] agent.rpcclient.health: subscribe call failed: err="rpc error: code = Unknown desc = ACL not found" failure_count=2 key=consul m=0xc001761ae0 topic=ServiceHealth
ishustava added a commit that referenced this issue Mar 18, 2024
Currently, when a client starts a blocking query and an ACL token expires within
that time, Consul will return ACL not found error with a 403 status code. However,
sometimes if an ACL token is invalidated at the same time as the query's deadline is reached,
Consul will instead return an empty response with a 200 status code.

This is because of the events being executed.
1. Client issues a blocking query request with timeout `t`.
2. ACL is deleted.
3. Server detects a change in ACLs and force closes the gRPC stream.
4. Client resubscribes with the same token and resets its state (view).
5. Client sees "ACL not found" error.

If ACL is deleted before step 4, the client is unaware that the stream was closed due to
an ACL error and will return an empty view (from the reset state) with the 200 status code.

To fix this problem, we introduce another state to the subsciption to indicate when a change
to ACLs has occured. If the server sees that there was an error due to ACL change, it will
re-authenticate the request and return an error if the token is no longer valid.

Fixes #20790
ishustava added a commit that referenced this issue Mar 22, 2024
#20876)

Currently, when a client starts a blocking query and an ACL token expires within
that time, Consul will return ACL not found error with a 403 status code. However,
sometimes if an ACL token is invalidated at the same time as the query's deadline is reached,
Consul will instead return an empty response with a 200 status code.

This is because of the events being executed.
1. Client issues a blocking query request with timeout `t`.
2. ACL is deleted.
3. Server detects a change in ACLs and force closes the gRPC stream.
4. Client resubscribes with the same token and resets its state (view).
5. Client sees "ACL not found" error.

If ACL is deleted before step 4, the client is unaware that the stream was closed due to
an ACL error and will return an empty view (from the reset state) with the 200 status code.

To fix this problem, we introduce another state to the subsciption to indicate when a change
to ACLs has occured. If the server sees that there was an error due to ACL change, it will
re-authenticate the request and return an error if the token is no longer valid.

Fixes #20790
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant