Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing" #3606

Closed
1 of 2 tasks
lzh-lab opened this issue Dec 15, 2020 · 3 comments

Comments

@lzh-lab
Copy link
Contributor

lzh-lab commented Dec 15, 2020

Describe the bug
After boot cortex a while, display a error log:

level=error ts=2020-12-15T09:20:44.990336271Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"

To Reproduce
Steps to reproduce the behavior:

  1. Start Cortex (SHA or version)
  2. Wait 2 minutes

Environment:

  • Infrastructure: CentOS 7.4 amd64
  • Deployment tool: vscode debug

Storage Engine

  • Blocks
  • Chunks

Additional Context

debug config

# Configuration for running Cortex in single-process mode.
# This should not be used in production.  It is only for getting started
# and development.

# Disable the requirement that every request to Cortex has a
# X-Scope-OrgID header. `fake` will be substituted in instead.
auth_enabled: false

server:
  http_listen_port: 9009
  # log_level: debug

  # Configure the server to allow messages up to 100MB.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

distributor:
  shard_by_all_labels: true
  pool:
    health_check_ingesters: true

ingester_client:
  grpc_client_config:
    # Configure the client to allow messages up to 100MB.
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
    grpc_compression: gzip

ingester:
  lifecycler:
    # The address to advertise for this ingester.  Will be autodiscovered by
    # looking up address on eth0 or en0; can be specified if this fails.
    # address: 127.0.0.1

    # We want to start immediately and flush on shutdown.
    interface_names: [eth1]
    join_after: 0
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512

    # Use an in memory ring store, so we don't need to launch a Consul.
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

storage:
  engine: blocks

blocks_storage:
  tsdb:
    dir: /tmp/cortex/tsdb

  bucket_store:
    sync_dir: /tmp/cortex/tsdb-sync

  # You can choose between local storage and Amazon S3, Google GCS and Azure storage. Each option requires additional configuration
  # as shown below. All options can be configured via flags as well which might be handy for secret inputs.
  backend: s3 # s3, gcs, azure or filesystem are valid options
  s3:
    bucket_name: cortex
    endpoint: 127.0.0.1:9000
    # Configure your S3 credentials below.
    secret_access_key: "minioadmin"
    access_key_id: "minioadmin"
    insecure: true
#  gcs:
#    bucket_name: cortex
#    service_account: # if empty or omitted Cortex will use your default service account as per Google's fallback logic
#  azure:
#    account_name:
#    account_key:
#    container_name:
#    endpoint_suffix:
#    max_retries: # Number of retries for recoverable errors (defaults to 20)
# filesystem:
#   dir: ./data/tsdb

compactor:
  data_dir: /tmp/cortex/compactor
  sharding_ring:
    kvstore:
      store: inmemory

frontend_worker:
  match_max_concurrent: true

ruler:
  enable_api: true
  enable_sharding: false
  storage:
    type: s3
    s3:
      s3: http://minioadmin:[email protected]:9000/configs
      s3forcepathstyle: true

all log

2020-12-15T17:19:16+08:00 debug layer=debugger halting
2020-12-15T17:19:16+08:00 warning layer=debugger reading debug_info: concrete subprogram without address range at 0x129f
Failed to get state - Process 9754 has exited with status 0
dumpStacktrace - Failed to get debugger state Process 9754 has exited with status 0
Failed to get threads - Process 9754 has exited with status 0
dumpStacktrace: Failed to produce stack traceProcess 9754 has exited with status 0
dumpStacktrace - Failed to get debugger state Process 9754 has exited with status 0
dumpStacktrace: Failed to produce stack traceProcess 9754 has exited with status 0
Process exiting with code: 0
API server listening at: 127.0.0.1:2232
2020-12-15T17:19:23+08:00 info layer=debugger launching process with args: [/data/cortex/cmd/cortex/__debug_bin -config.file=/data/cortex/docs/configuration/single-process-config-blocks-test.yaml]
2020-12-15T17:19:24+08:00 info layer=debugger created breakpoint: &api.Breakpoint{ID:1, Name:"", Addr:0x2a4264e, Addrs:[]uint64{0x2a4264e}, File:"/data/cortex/pkg/cortex/modules.go", Line:683, FunctionName:"github.com/cortexproject/cortex/pkg/cortex.(*Cortex).initCompactor", Cond:"", Tracepoint:false, TraceReturn:false, Goroutine:false, Stacktrace:0, Variables:[]string(nil), LoadArgs:(*api.LoadConfig)(0xc003e2d290), LoadLocals:(*api.LoadConfig)(0xc003e2d2c0), HitCount:map[string]uint64{}, TotalHitCount:0x0}
2020-12-15T17:19:24+08:00 debug layer=debugger continuing
level=info ts=2020-12-15T09:19:24.966769368Z caller=main.go:187 msg="Starting Cortex" version="(version=, branch=, revision=)"
level=info ts=2020-12-15T09:19:24.967194308Z caller=server.go:229 http=[::]:9009 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2020-12-15T09:19:24.971032718Z caller=inmemory.go:172 msg="created in-memory index cache" maxItemSizeBytes=134217728 maxSizeBytes=1073741824 maxItems=maxInt
level=info ts=2020-12-15T09:19:24.975623793Z caller=mapper.go:38 msg="cleaning up mapped rules directory" path=/rules
level=warn ts=2020-12-15T09:19:24.977346882Z caller=modules.go:293 msg="Worker address is empty in single binary mode.  Attempting automatic worker configuration.  If queries are unresponsive consider configuring the worker explicitly." address=127.0.0.1:9095
level=info ts=2020-12-15T09:19:24.977397314Z caller=worker.go:109 msg="Starting querier worker connected to query-frontend" frontend=127.0.0.1:9095
level=info ts=2020-12-15T09:19:24.977699904Z caller=module_service.go:58 msg=initialising module=server
level=info ts=2020-12-15T09:19:24.979450369Z caller=module_service.go:58 msg=initialising module=memberlist-kv
level=info ts=2020-12-15T09:19:24.979838292Z caller=module_service.go:58 msg=initialising module=store-gateway
level=info ts=2020-12-15T09:19:24.980056534Z caller=bucket_stores.go:123 msg="synchronizing TSDB blocks for all users"
level=info ts=2020-12-15T09:19:24.979510316Z caller=module_service.go:58 msg=initialising module=query-frontend-tripperware
level=info ts=2020-12-15T09:19:24.981582444Z caller=module_service.go:58 msg=initialising module=ingester-service
level=info ts=2020-12-15T09:19:24.981667825Z caller=module_service.go:58 msg=initialising module=ring
level=info ts=2020-12-15T09:19:24.982095175Z caller=module_service.go:58 msg=initialising module=distributor-service
level=info ts=2020-12-15T09:19:24.981913071Z caller=ingester_v2.go:1328 msg="opening existing TSDBs"
level=info ts=2020-12-15T09:19:24.981724703Z caller=module_service.go:58 msg=initialising module=store-queryable
level=info ts=2020-12-15T09:19:24.981800966Z caller=module_service.go:58 msg=initialising module=query-frontend
level=info ts=2020-12-15T09:19:24.982165097Z caller=client.go:242 msg="value is nil" key=collectors/ring index=1
level=info ts=2020-12-15T09:19:24.984306094Z caller=bucket_stores.go:132 msg="successfully synchronized TSDB blocks for all users"
level=info ts=2020-12-15T09:19:24.986142024Z caller=module_service.go:58 msg=initialising module=querier
level=info ts=2020-12-15T09:19:24.986220946Z caller=module_service.go:58 msg=initialising module=ruler
level=info ts=2020-12-15T09:19:24.986252642Z caller=worker.go:189 msg="adding connection" addr=127.0.0.1:9095
level=info ts=2020-12-15T09:19:24.986288402Z caller=ruler.go:404 msg="ruler up and running"
level=warn ts=2020-12-15T09:19:24.987986927Z caller=logging.go:71 traceID=5dd9dee4ad9e893b msg="POST /api/prom/push (500) 2.30105ms Response: \"DoBatch: IngesterCount <= 0\\n\" ws: false; Content-Encoding: snappy; Content-Length: 8187; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.23.0; X-Prometheus-Remote-Write-Version: 0.1.0; "
level=info ts=2020-12-15T09:19:24.991855266Z caller=head.go:645 org_id=fake msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2020-12-15T09:19:24.991914599Z caller=head.go:659 org_id=fake msg="On-disk memory mappable chunks replay completed" duration=10.287µs
level=info ts=2020-12-15T09:19:24.991934562Z caller=head.go:665 org_id=fake msg="Replaying WAL, this may take a while"
level=info ts=2020-12-15T09:19:24.996487985Z caller=head.go:717 org_id=fake msg="WAL segment loaded" segment=0 maxSegment=9
level=info ts=2020-12-15T09:19:25.002706885Z caller=head.go:717 org_id=fake msg="WAL segment loaded" segment=1 maxSegment=9
level=info ts=2020-12-15T09:19:25.005165025Z caller=head.go:717 org_id=fake msg="WAL segment loaded" segment=2 maxSegment=9
level=info ts=2020-12-15T09:19:25.006035337Z caller=head.go:717 org_id=fake msg="WAL segment loaded" segment=3 maxSegment=9
level=info ts=2020-12-15T09:19:25.011190199Z caller=head.go:717 org_id=fake msg="WAL segment loaded" segment=4 maxSegment=9
level=info ts=2020-12-15T09:19:25.012160626Z caller=head.go:717 org_id=fake msg="WAL segment loaded" segment=5 maxSegment=9
level=info ts=2020-12-15T09:19:25.044127191Z caller=head.go:717 org_id=fake msg="WAL segment loaded" segment=6 maxSegment=9
level=info ts=2020-12-15T09:19:25.04509218Z caller=head.go:717 org_id=fake msg="WAL segment loaded" segment=7 maxSegment=9
level=info ts=2020-12-15T09:19:25.063447115Z caller=head.go:717 org_id=fake msg="WAL segment loaded" segment=8 maxSegment=9
level=info ts=2020-12-15T09:19:25.063870344Z caller=head.go:717 org_id=fake msg="WAL segment loaded" segment=9 maxSegment=9
level=info ts=2020-12-15T09:19:25.063901799Z caller=head.go:722 org_id=fake msg="WAL replay completed" checkpoint_replay_duration=59.602µs wal_replay_duration=71.88942ms total_replay_duration=71.995312ms
level=info ts=2020-12-15T09:19:25.066977328Z caller=db.go:1363 org_id=fake msg="Compactions disabled"
level=info ts=2020-12-15T09:19:25.067024409Z caller=ingester_v2.go:1241 org_id=fake msg="Running compaction after WAL replay"
level=info ts=2020-12-15T09:19:25.067201365Z caller=ingester_v2.go:1421 msg="successfully opened existing TSDBs"
level=info ts=2020-12-15T09:19:25.067262395Z caller=lifecycler.go:521 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2020-12-15T09:19:25.067415288Z caller=cortex.go:379 msg="Cortex started"
level=info ts=2020-12-15T09:19:25.067438667Z caller=lifecycler.go:550 msg="instance not found in ring, adding with no tokens" ring=ingester
level=info ts=2020-12-15T09:19:25.067607621Z caller=lifecycler.go:397 msg="auto-joining cluster after timeout" ring=ingester
level=error ts=2020-12-15T09:20:44.990136969Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990131897Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990259343Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990286233Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990305543Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990322011Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990336271Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990361939Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990349006Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990349224Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990366917Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990384077Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990401446Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990422134Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990425751Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990407968Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990455066Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990461423Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990473467Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:20:44.990146327Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=info ts=2020-12-15T09:24:33.723057284Z caller=gateway.go:309 msg="synchronizing TSDB blocks for all users" reason=periodic
level=info ts=2020-12-15T09:24:33.725040615Z caller=gateway.go:315 msg="successfully synchronized TSDB blocks for all users" reason=periodic
level=error ts=2020-12-15T09:24:41.367500077Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367570975Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367601402Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.36758953Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367641016Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367633016Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367687345Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367684062Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367689261Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367702186Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367720737Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.3676908Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367716447Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367742294Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367754501Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367742467Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367776973Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367797044Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367798226Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-12-15T09:24:41.367806515Z caller=frontend_processor.go:59 msg="error processing requests" address=127.0.0.1:9095 err="rpc error: code = Unavailable desc = transport is closing"

@lzh-lab
Copy link
Contributor Author

lzh-lab commented Dec 15, 2020

@pstibrany Can you help me troubleshoot this problem?

@pstibrany
Copy link
Contributor

We have observed this as well. It is caused by low gRPC keepalive settings (#3431). It turns out that settings also need to be applied on the server side, otherwise server will disconnect clients that are "pinging" it too often. (See #3431 (comment))

In your case, you see queriers being connected to frontend. It's likely that you don't have enough queries coming to the system, so queriers will start sending PING messages to frontend instead (at gRPC level). But default settings for gRPC server is that if client sends too many PING messages within 5 minutes, server will disconnect such client. That is where "transport is closing" error comes from.

This can be fixed by setting lower values on the server (in this case frontend) side, using new options available from weaveworks/common#205 (cortex PR #3524), specifically -server.grpc.keepalive.min-time-between-pings=10s and -server.grpc.keepalive.ping-without-stream-allowed=true.

@lzh-lab
Copy link
Contributor Author

lzh-lab commented Dec 15, 2020

We have observed this as well. It is caused by low gRPC keepalive settings (#3431). It turns out that settings also need to be applied on the server side, otherwise server will disconnect clients that are "pinging" it too often. (See #3431 (comment))

In your case, you see queriers being connected to frontend. It's likely that you don't have enough queries coming to the system, so queriers will start sending PING messages to frontend instead (at gRPC level). But default settings for gRPC server is that if client sends too many PING messages within 5 minutes, server will disconnect such client. That is where "transport is closing" error comes from.

This can be fixed by setting lower values on the server (in this case frontend) side, using new options available from weaveworks/common#205 (cortex PR #3524), specifically -server.grpc.keepalive.min-time-between-pings=10s and -server.grpc.keepalive.ping-without-stream-allowed=true.

it worked. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants