Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrecoverable OOMKill during stream/consumer recovery on startup [v2.10.20] #5929

Closed
jzhn opened this issue Sep 25, 2024 · 14 comments · Fixed by #5973
Closed

Unrecoverable OOMKill during stream/consumer recovery on startup [v2.10.20] #5929

jzhn opened this issue Sep 25, 2024 · 14 comments · Fixed by #5973
Labels
defect Suspected defect such as a bug or regression

Comments

@jzhn
Copy link

jzhn commented Sep 25, 2024

Observed behavior

  • NATs pod enters CrashLoopBackoff state as it gets OOMKilled immediately after startup.

  • Debug log shows no useful information. the last line of log mentions that NATs is starting restore for stream.

2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.171931 [INF] Starting nats-server
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172002 [INF]   Version:  2.10.20
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172004 [INF]   Git:      [7140387]
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172006 [DBG]   Go build: go1.22.6
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172007 [INF]   Cluster:  kubernetes-nats
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172009 [INF]   Name:     kubernetes-nats-0
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172010 [INF]   Node:     0EXN8Tyd
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172011 [INF]   ID:       NBV5WBW3LR5DERLSSC4NZMNOVEJMM2I52FTO4M7SNZDW6ZAK4Y2PUPOH
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174222 [INF] profiling port: 65432
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174231 [INF] Using configuration file: /etc/nats-config/nats.conf
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174297 [DBG] Created system account: "$SYS"
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174623 [INF] Starting http monitor on 0.0.0.0:8222
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174653 [INF] Starting JetStream
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174663 [DBG] JetStream creating dynamic configuration - 0 B memory, 280.00 GB disk
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174750 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174754 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174755 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174756 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174757 [INF] 
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174758 [INF]          https://docs.nats.io/jetstream
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174759 [INF] 
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174760 [INF] ---------------- JETSTREAM ----------------
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174761 [INF]   Max Memory:      0 B
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174763 [INF]   Max Storage:     280.00 GB
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174764 [INF]   Store Directory: "/data/jetstream"
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174765 [INF] -------------------------------------------
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174833 [DBG]   Exports:
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174840 [DBG]      $JS.API.>
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174867 [DBG] Enabled JetStream for account "$G"
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174874 [DBG]   Max Memory:      -1 B
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174876 [DBG]   Max Storage:     -1 B
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174893 [DBG] Recovering JetStream state for account "$G"
2024-09-25 06:17:26.175	[3242] 2024/09/25 13:17:26.175749 [INF]   Starting restore for stream '$G > natse2e#interest'
2024-09-25 06:17:37.324	[3242] 2024/09/25 13:17:37.324591 [WRN] Filestore [natse2e#interest] loadBlock error: message block data missing
2024-09-25 06:17:37.324	[3242] 2024/09/25 13:17:37.324908 [WRN] Filestore [natse2e#interest] loadBlock error: message block data missing
2024-09-25 06:17:37.325	[3242] 2024/09/25 13:17:37.325677 [WRN] Filestore [natse2e#interest] loadBlock error: message block data missing
2024-09-25 06:17:37.326	[3242] 2024/09/25 13:17:37.326397 [INF]   Restored 526,717,218 messages for stream '$G > natse2e#interest' in 11.151s
2024-09-25 06:17:37.326	[3242] 2024/09/25 13:17:37.326428 [INF]   Recovering 1 consumers for stream - '$G > natse2e#interest'
  • Memory profile shows that despite the GOMEMLIMIT was set to 819MiB, nats golang process occupied ~1GB memory, reaching container memory limit.
    • (*msgBlock).loadBlock and (*msgBlock).rebuildStateLocked both takes ~45% total memory
(pprof) top10
Showing nodes accounting for 1049.32MB, 98.80% of 1062.01MB total
Dropped 25 nodes (cum <= 5.31MB)
Showing top 10 nodes out of 22
      flat  flat%   sum%        cum   cum%
  476.24MB 44.84% 44.84%   484.24MB 45.60%  github.com/nats-io/nats-server/v2/server.(*msgBlock).loadBlock
  469.94MB 44.25% 89.09%  1030.18MB 97.00%  github.com/nats-io/nats-server/v2/server.(*msgBlock).rebuildStateLocked
      76MB  7.16% 96.25%       76MB  7.16%  github.com/minio/highwayhash.(*digest).Sum

Expected behavior

NATs is able to recover stream and consumers on startup within the GOMEMLIMIT.

Server and client version

  • nats-server: nats:2.10.20-alpine
  • nats helm chart: 1.2.4

Host environment

  • Kubernetes statefulset deployment with 3 replicas
  • EBS-backed persistent volume
  • x86-64 CPUs (m7i.2xlarge)
  • container memory limit: 1Gi
  • GOMEMLIMIT: 819MiB (roughly 80% of container memory limit)

Steps to reproduce

  • In a clean, new NATs jetstream cluster, create a new interest, file-based, 3-replica stream
  • Create a single persistent consumer, attached to the stream.
  • Publish a large number (500M) of tiny (300 bytes) messages to the stream, while keeping all subscribers of the consumer disabled, so we could accumulate a large amount of messages in store.
  • Manage to trigger a crash at NATs process
  • Verify that the crashed NATs is never able to restart again - it gets OOMKilled immediately after restart.

Notes

  • this issue was reproduced in NATs 2.10.18 and 2.10.20
  • the only way to get NATs out of crash loop (without losing message) is to temporarily assign a higher memory limit to the NATs container.

pprof memory dump

Since the nats process gets killed immediately after startup, it takes many attempts to capture the memory dump from curl -o mem.pprof http://localhost:65432/debug/pprof/allocs. Given that this memory dump uses pretty much the entire memory limit, it should represent the state just before OOMKill.

mem.pprof.zip

pprof001

@jzhn jzhn added the defect Suspected defect such as a bug or regression label Sep 25, 2024
@derekcollison
Copy link
Member

Do you properly set GOMEMLIMIT?

@jzhn
Copy link
Author

jzhn commented Sep 26, 2024

Hi @derekcollison , yes, we've tried setting GOMEMLIMIT to 50%-80% of the container memory limit, but NATs-server still crashes on startup. Please take a look at the dump file, it's taken with a GOMEMLIMIT=819MiB while container memory limit is 1Gi. Looks like nats tries to load some large message blocks into memory during recovery, and interestingly memory consumption are concentrated on two functions, each taking roughly 45% of the memory limit, hopefully we can identify some improvement there.

What's not clear to me is how to reproduce this crash condition and make NATs having to recover from such large message block. My team is still trying to find a deterministic way.

@kozlovic
Copy link
Member

I checked the profile with a 2.10.20 using --inuse_space (as opposed to alloc_space).

Based on the errors in the banner and the profile, it looks like the server was trying to some of the message blocks but for some of them, that returned an error saying that they were not found, so that triggered a rebuild of the state.

The memory profile shows that the server is preparing a slice of ~470MB to read a file based on the size indicated by the file stats:

        .          .   5827:	var sz int
         .          .   5828:	if info, err := f.Stat(); err == nil {
         .          .   5829:		sz64 := info.Size()
         .          .   5830:		if int64(int(sz64)) == sz64 {
         .          .   5831:			sz = int(sz64)
         .          .   5832:		} else {
         .          .   5833:			return nil, errMsgBlkTooBig
         .          .   5834:		}
         .          .   5835:	}
         .          .   5836:
         .          .   5837:	if buf == nil {
         .          .   5838:		buf = getMsgBlockBuf(sz)
         .          .   5839:		if sz > cap(buf) {
         .          .   5840:			// We know we will make a new one so just recycle for now.
         .          .   5841:			recycleMsgBlockBuf(buf)
         .          .   5842:			buf = nil
         .          .   5843:		}
         .          .   5844:	}
         .          .   5845:
         .          .   5846:	if sz > cap(buf) {
  476.24MB   476.24MB   5847:		buf = make([]byte, sz)
         .          .   5848:	} else {
         .          .   5849:		buf = buf[:sz]
         .          .   5850:	}

This is really suspicious since by default message blocks are rather small (around 5MB). If you could check the /data/jetstream/$G/streams/<stream name>/msgs directory (based on your starting banner) and list all files present? Do you see anything strange?

@neilalexander
Copy link
Member

The 476.24MB on the inuse_space profile means that there are 476MB total heap allocations originating from that line, not that it's a single allocation of that size.

We still shouldn't be hitting that make() line in ideal circumstances though, so I'm still curious if there are any files in the msgs directory that are larger than 8388608 bytes.

I wonder if we are loading in blocks more quickly than they are being expired in this case, or if we should be recycling the blocks more aggressively but aren't.

@kozlovic
Copy link
Member

The 476.24MB on the inuse_space profile means that there are 476MB total heap allocations originating from that line, not that it's a single allocation of that size.

I tried to fill a server with millions of 300 bytes messages and removed the index.db to cause a full restore. When doing the inuse_space on my server, it would report only a ~5MB for this slice. I guess what you are saying is that it should have been reused so it would stay that low?

@jzhn
Copy link
Author

jzhn commented Sep 27, 2024

If you could check the /data/jetstream/$G/streams//msgs directory (based on your starting banner) and list all files present? Do you see anything strange?

Sorry, we have cleared up the persistent storage to fix the cluster so that we can continue with other parts of testing. We are trying to reproduce the issue again so we can provide more information. Will post it here ASAP.

@kozlovic
Copy link
Member

@neilalexander But those buffers are put back into a sync.Pool, so we don't have really a control on how many and how long do they stay in the pool. Maybe we need to limit the size of a slice than can be put back into the pool?

@jzhn
Copy link
Author

jzhn commented Oct 2, 2024

@kozlovic : my team have reproduced the issue and we have a snapshot of jetstream data volume available.

Looking at /data/jetstream/$G/streams/natse2e#interest/msgs:

  • It appear to have continous <num>.blk files, ranging from 1265.blk all the way to 21598.blk
  • most blk files are around 4.0M in size, however, there are a few exceptions, and a giant one with 540M in size.
ls -lah | grep -v 4.0M
total 81G
drwxr-x---. 2 root root 524K Oct  1 23:46 .
drwxr-x---. 4 root root 4.0K Oct  2 16:24 ..
-rw-r-----. 1 root root 6.1M Oct  1 23:46 21585.blk
-rw-r-----. 1 root root 8.8M Oct  1 23:46 21586.blk
-rw-r-----. 1 root root 8.9M Oct  1 23:46 21587.blk
-rw-r-----. 1 root root 8.4M Oct  1 23:46 21588.blk
-rw-r-----. 1 root root 8.9M Oct  1 23:46 21589.blk
-rw-r-----. 1 root root 8.8M Oct  1 23:46 21590.blk
-rw-r-----. 1 root root 8.9M Oct  1 23:46 21591.blk
-rw-r-----. 1 root root 8.4M Oct  1 23:46 21592.blk
-rw-r-----. 1 root root 6.7M Oct  1 23:46 21593.blk
-rw-r-----. 1 root root 7.9M Oct  1 23:46 21594.blk
-rw-r-----. 1 root root 7.6M Oct  1 23:46 21595.blk
-rw-r-----. 1 root root 8.0M Oct  1 23:46 21596.blk
-rw-r-----. 1 root root 4.9M Oct  1 23:46 21597.blk
-rw-r-----. 1 root root 540M Oct  1 23:46 21598.blk
-rw-r-----. 1 root root 608K Oct  1 23:45 index.db

So I suspect the 540M one might be the source of trouble. If you want I can dump it somewhere to share with you.

@wallyqs
Copy link
Member

wallyqs commented Oct 2, 2024

@jzhn can you share the stream config of the natse2e#interest stream?

@jzhn
Copy link
Author

jzhn commented Oct 2, 2024

Stream

nats stream  info natse2e#interest --json
{
  "config": {
    "name": "natse2e#interest",
    "subjects": [
      "natse2e#interest"
    ],
    "retention": "interest",
    "max_consumers": -1,
    "max_msgs_per_subject": -1,
    "max_msgs": -1,
    "max_bytes": -1,
    "max_age": 0,
    "max_msg_size": -1,
    "storage": "file",
    "discard": "old",
    "num_replicas": 3,
    "duplicate_window": 10000000000,
    "sealed": false,
    "deny_delete": false,
    "deny_purge": false,
    "allow_rollup_hdrs": false,
    "allow_direct": false,
    "mirror_direct": false,
    "consumer_limits": {}
  },
  "created": "2024-09-26T07:28:01.27872814Z",
  "state": {
    "messages": 304306859,
    "bytes": 82085999273,
    "first_seq": 47746067,
    "first_ts": "2024-09-26T22:49:56.351913757Z",
    "last_seq": 352052925,
    "last_ts": "2024-10-01T22:45:35.644775358Z",
    "num_subjects": 1,
    "consumer_count": 1
  },
  "cluster": {
    "name": "kubernetes-nats",
    "leader": "kubernetes-nats-2",
    "replicas": [
      {
        "name": "Server name unknown at this time (peerID: 0EXN8Tyd)",
        "current": false,
        "offline": true,
        "active": 0
      },
      {
        "name": "kubernetes-nats-1",
        "current": true,
        "offline": true,
        "active": 6649434164
      }
    ]
  },
  "ts": "2024-10-02T17:43:02.197390102Z"
}

Consumer

nats consumer info natse2e#interest natse2e-interest-consumer --json
{
  "stream_name": "natse2e#interest",
  "name": "natse2e-interest-consumer",
  "config": {
    "ack_policy": "explicit",
    "ack_wait": 30000000000,
    "deliver_policy": "all",
    "durable_name": "natse2e-interest-consumer",
    "name": "natse2e-interest-consumer",
    "filter_subject": "natse2e#interest",
    "max_ack_pending": 1000,
    "max_deliver": -1,
    "max_waiting": 512,
    "replay_policy": "instant",
    "num_replicas": 0
  },
  "created": "2024-10-01T21:04:26.947979341Z",
  "delivered": {
    "consumer_seq": 47542664,
    "stream_seq": 47746164,
    "last_active": "2024-10-02T17:42:49.801192217Z"
  },
  "ack_floor": {
    "consumer_seq": 47542566,
    "stream_seq": 47746066,
    "last_active": "2024-10-02T17:42:55.559765209Z"
  },
  "num_ack_pending": 98,
  "num_redelivered": 0,
  "num_waiting": 0,
  "num_pending": 304306761,
  "cluster": {
    "name": "kubernetes-nats",
    "leader": "kubernetes-nats-2",
    "replicas": [
      {
        "name": "Server name unknown at this time (peerID: 0EXN8Tyd)",
        "current": false,
        "offline": true,
        "active": 0
      },
      {
        "name": "kubernetes-nats-1",
        "current": true,
        "active": 493106564
      }
    ]
  },
  "ts": "2024-10-02T17:45:29.229527577Z"
}

@derekcollison
Copy link
Member

Can you share the tarball of that directory?

@jzhn
Copy link
Author

jzhn commented Oct 3, 2024

@derekcollison : the tarball will be ~70GB in size. I'm generating it. what's your preferred way to receive this file?

@derekcollison
Copy link
Member

wormhole or croc

@jzhn
Copy link
Author

jzhn commented Oct 3, 2024

@derekcollison : i just sent you an URL through email, please take a look and let us know once you finish downloading. Thanks.

@wallyqs wallyqs changed the title Unrecoverable OOMKill during stream/consumer recovery on startup Unrecoverable OOMKill during stream/consumer recovery on startup [v2.10.20] Oct 8, 2024
derekcollison added a commit that referenced this issue Oct 8, 2024
This fixes that condition and also adds in compaction logic once we know
the tombstones are no longer relevant.

Resolves: #5929  (Why we have large block exceeding maximum block size)

Signed-off-by: Derek Collison <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants