Unrecoverable OOMKill during stream/consumer recovery on startup [v2.10.20] #5929

jzhn · 2024-09-25T18:51:04Z

Observed behavior

NATs pod enters CrashLoopBackoff state as it gets OOMKilled immediately after startup.
Debug log shows no useful information. the last line of log mentions that NATs is starting restore for stream.

2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.171931 [INF] Starting nats-server
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172002 [INF]   Version:  2.10.20
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172004 [INF]   Git:      [7140387]
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172006 [DBG]   Go build: go1.22.6
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172007 [INF]   Cluster:  kubernetes-nats
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172009 [INF]   Name:     kubernetes-nats-0
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172010 [INF]   Node:     0EXN8Tyd
2024-09-25 06:17:26.172	[3242] 2024/09/25 13:17:26.172011 [INF]   ID:       NBV5WBW3LR5DERLSSC4NZMNOVEJMM2I52FTO4M7SNZDW6ZAK4Y2PUPOH
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174222 [INF] profiling port: 65432
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174231 [INF] Using configuration file: /etc/nats-config/nats.conf
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174297 [DBG] Created system account: "$SYS"
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174623 [INF] Starting http monitor on 0.0.0.0:8222
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174653 [INF] Starting JetStream
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174663 [DBG] JetStream creating dynamic configuration - 0 B memory, 280.00 GB disk
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174750 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174754 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174755 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174756 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174757 [INF] 
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174758 [INF]          https://docs.nats.io/jetstream
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174759 [INF] 
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174760 [INF] ---------------- JETSTREAM ----------------
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174761 [INF]   Max Memory:      0 B
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174763 [INF]   Max Storage:     280.00 GB
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174764 [INF]   Store Directory: "/data/jetstream"
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174765 [INF] -------------------------------------------
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174833 [DBG]   Exports:
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174840 [DBG]      $JS.API.>
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174867 [DBG] Enabled JetStream for account "$G"
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174874 [DBG]   Max Memory:      -1 B
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174876 [DBG]   Max Storage:     -1 B
2024-09-25 06:17:26.174	[3242] 2024/09/25 13:17:26.174893 [DBG] Recovering JetStream state for account "$G"
2024-09-25 06:17:26.175	[3242] 2024/09/25 13:17:26.175749 [INF]   Starting restore for stream '$G > natse2e#interest'
2024-09-25 06:17:37.324	[3242] 2024/09/25 13:17:37.324591 [WRN] Filestore [natse2e#interest] loadBlock error: message block data missing
2024-09-25 06:17:37.324	[3242] 2024/09/25 13:17:37.324908 [WRN] Filestore [natse2e#interest] loadBlock error: message block data missing
2024-09-25 06:17:37.325	[3242] 2024/09/25 13:17:37.325677 [WRN] Filestore [natse2e#interest] loadBlock error: message block data missing
2024-09-25 06:17:37.326	[3242] 2024/09/25 13:17:37.326397 [INF]   Restored 526,717,218 messages for stream '$G > natse2e#interest' in 11.151s
2024-09-25 06:17:37.326	[3242] 2024/09/25 13:17:37.326428 [INF]   Recovering 1 consumers for stream - '$G > natse2e#interest'

Memory profile shows that despite the GOMEMLIMIT was set to 819MiB, nats golang process occupied ~1GB memory, reaching container memory limit.
- (*msgBlock).loadBlock and (*msgBlock).rebuildStateLocked both takes ~45% total memory

(pprof) top10
Showing nodes accounting for 1049.32MB, 98.80% of 1062.01MB total
Dropped 25 nodes (cum <= 5.31MB)
Showing top 10 nodes out of 22
      flat  flat%   sum%        cum   cum%
  476.24MB 44.84% 44.84%   484.24MB 45.60%  github.com/nats-io/nats-server/v2/server.(*msgBlock).loadBlock
  469.94MB 44.25% 89.09%  1030.18MB 97.00%  github.com/nats-io/nats-server/v2/server.(*msgBlock).rebuildStateLocked
      76MB  7.16% 96.25%       76MB  7.16%  github.com/minio/highwayhash.(*digest).Sum

Expected behavior

NATs is able to recover stream and consumers on startup within the GOMEMLIMIT.

Server and client version

nats-server: nats:2.10.20-alpine
nats helm chart: 1.2.4

Host environment

Kubernetes statefulset deployment with 3 replicas
EBS-backed persistent volume
x86-64 CPUs (m7i.2xlarge)
container memory limit: 1Gi
GOMEMLIMIT: 819MiB (roughly 80% of container memory limit)

Steps to reproduce

In a clean, new NATs jetstream cluster, create a new interest, file-based, 3-replica stream
Create a single persistent consumer, attached to the stream.
Publish a large number (500M) of tiny (300 bytes) messages to the stream, while keeping all subscribers of the consumer disabled, so we could accumulate a large amount of messages in store.
Manage to trigger a crash at NATs process
Verify that the crashed NATs is never able to restart again - it gets OOMKilled immediately after restart.

Notes

this issue was reproduced in NATs 2.10.18 and 2.10.20
the only way to get NATs out of crash loop (without losing message) is to temporarily assign a higher memory limit to the NATs container.

pprof memory dump

Since the nats process gets killed immediately after startup, it takes many attempts to capture the memory dump from curl -o mem.pprof http://localhost:65432/debug/pprof/allocs. Given that this memory dump uses pretty much the entire memory limit, it should represent the state just before OOMKill.

mem.pprof.zip

The text was updated successfully, but these errors were encountered:

derekcollison · 2024-09-26T03:55:40Z

Do you properly set GOMEMLIMIT?

jzhn · 2024-09-26T15:59:46Z

Hi @derekcollison , yes, we've tried setting GOMEMLIMIT to 50%-80% of the container memory limit, but NATs-server still crashes on startup. Please take a look at the dump file, it's taken with a GOMEMLIMIT=819MiB while container memory limit is 1Gi. Looks like nats tries to load some large message blocks into memory during recovery, and interestingly memory consumption are concentrated on two functions, each taking roughly 45% of the memory limit, hopefully we can identify some improvement there.

What's not clear to me is how to reproduce this crash condition and make NATs having to recover from such large message block. My team is still trying to find a deterministic way.

kozlovic · 2024-09-27T01:09:08Z

I checked the profile with a 2.10.20 using --inuse_space (as opposed to alloc_space).

Based on the errors in the banner and the profile, it looks like the server was trying to some of the message blocks but for some of them, that returned an error saying that they were not found, so that triggered a rebuild of the state.

The memory profile shows that the server is preparing a slice of ~470MB to read a file based on the size indicated by the file stats:

        .          .   5827:	var sz int
         .          .   5828:	if info, err := f.Stat(); err == nil {
         .          .   5829:		sz64 := info.Size()
         .          .   5830:		if int64(int(sz64)) == sz64 {
         .          .   5831:			sz = int(sz64)
         .          .   5832:		} else {
         .          .   5833:			return nil, errMsgBlkTooBig
         .          .   5834:		}
         .          .   5835:	}
         .          .   5836:
         .          .   5837:	if buf == nil {
         .          .   5838:		buf = getMsgBlockBuf(sz)
         .          .   5839:		if sz > cap(buf) {
         .          .   5840:			// We know we will make a new one so just recycle for now.
         .          .   5841:			recycleMsgBlockBuf(buf)
         .          .   5842:			buf = nil
         .          .   5843:		}
         .          .   5844:	}
         .          .   5845:
         .          .   5846:	if sz > cap(buf) {
  476.24MB   476.24MB   5847:		buf = make([]byte, sz)
         .          .   5848:	} else {
         .          .   5849:		buf = buf[:sz]
         .          .   5850:	}

This is really suspicious since by default message blocks are rather small (around 5MB). If you could check the /data/jetstream/$G/streams/<stream name>/msgs directory (based on your starting banner) and list all files present? Do you see anything strange?

neilalexander · 2024-09-27T10:46:04Z

The 476.24MB on the inuse_space profile means that there are 476MB total heap allocations originating from that line, not that it's a single allocation of that size.

We still shouldn't be hitting that make() line in ideal circumstances though, so I'm still curious if there are any files in the msgs directory that are larger than 8388608 bytes.

I wonder if we are loading in blocks more quickly than they are being expired in this case, or if we should be recycling the blocks more aggressively but aren't.

kozlovic · 2024-09-27T12:36:56Z

The 476.24MB on the inuse_space profile means that there are 476MB total heap allocations originating from that line, not that it's a single allocation of that size.

I tried to fill a server with millions of 300 bytes messages and removed the index.db to cause a full restore. When doing the inuse_space on my server, it would report only a ~5MB for this slice. I guess what you are saying is that it should have been reused so it would stay that low?

jzhn · 2024-09-27T16:38:54Z

If you could check the /data/jetstream/$G/streams//msgs directory (based on your starting banner) and list all files present? Do you see anything strange?

Sorry, we have cleared up the persistent storage to fix the cluster so that we can continue with other parts of testing. We are trying to reproduce the issue again so we can provide more information. Will post it here ASAP.

kozlovic · 2024-09-28T17:14:21Z

@neilalexander But those buffers are put back into a sync.Pool, so we don't have really a control on how many and how long do they stay in the pool. Maybe we need to limit the size of a slice than can be put back into the pool?

jzhn · 2024-10-02T17:17:36Z

@kozlovic : my team have reproduced the issue and we have a snapshot of jetstream data volume available.

Looking at /data/jetstream/$G/streams/natse2e#interest/msgs:

It appear to have continous <num>.blk files, ranging from 1265.blk all the way to 21598.blk
most blk files are around 4.0M in size, however, there are a few exceptions, and a giant one with 540M in size.

ls -lah | grep -v 4.0M
total 81G
drwxr-x---. 2 root root 524K Oct  1 23:46 .
drwxr-x---. 4 root root 4.0K Oct  2 16:24 ..
-rw-r-----. 1 root root 6.1M Oct  1 23:46 21585.blk
-rw-r-----. 1 root root 8.8M Oct  1 23:46 21586.blk
-rw-r-----. 1 root root 8.9M Oct  1 23:46 21587.blk
-rw-r-----. 1 root root 8.4M Oct  1 23:46 21588.blk
-rw-r-----. 1 root root 8.9M Oct  1 23:46 21589.blk
-rw-r-----. 1 root root 8.8M Oct  1 23:46 21590.blk
-rw-r-----. 1 root root 8.9M Oct  1 23:46 21591.blk
-rw-r-----. 1 root root 8.4M Oct  1 23:46 21592.blk
-rw-r-----. 1 root root 6.7M Oct  1 23:46 21593.blk
-rw-r-----. 1 root root 7.9M Oct  1 23:46 21594.blk
-rw-r-----. 1 root root 7.6M Oct  1 23:46 21595.blk
-rw-r-----. 1 root root 8.0M Oct  1 23:46 21596.blk
-rw-r-----. 1 root root 4.9M Oct  1 23:46 21597.blk
-rw-r-----. 1 root root 540M Oct  1 23:46 21598.blk
-rw-r-----. 1 root root 608K Oct  1 23:45 index.db

So I suspect the 540M one might be the source of trouble. If you want I can dump it somewhere to share with you.

wallyqs · 2024-10-02T17:37:29Z

@jzhn can you share the stream config of the natse2e#interest stream?

jzhn · 2024-10-02T17:45:53Z

Stream

nats stream  info natse2e#interest --json
{
  "config": {
    "name": "natse2e#interest",
    "subjects": [
      "natse2e#interest"
    ],
    "retention": "interest",
    "max_consumers": -1,
    "max_msgs_per_subject": -1,
    "max_msgs": -1,
    "max_bytes": -1,
    "max_age": 0,
    "max_msg_size": -1,
    "storage": "file",
    "discard": "old",
    "num_replicas": 3,
    "duplicate_window": 10000000000,
    "sealed": false,
    "deny_delete": false,
    "deny_purge": false,
    "allow_rollup_hdrs": false,
    "allow_direct": false,
    "mirror_direct": false,
    "consumer_limits": {}
  },
  "created": "2024-09-26T07:28:01.27872814Z",
  "state": {
    "messages": 304306859,
    "bytes": 82085999273,
    "first_seq": 47746067,
    "first_ts": "2024-09-26T22:49:56.351913757Z",
    "last_seq": 352052925,
    "last_ts": "2024-10-01T22:45:35.644775358Z",
    "num_subjects": 1,
    "consumer_count": 1
  },
  "cluster": {
    "name": "kubernetes-nats",
    "leader": "kubernetes-nats-2",
    "replicas": [
      {
        "name": "Server name unknown at this time (peerID: 0EXN8Tyd)",
        "current": false,
        "offline": true,
        "active": 0
      },
      {
        "name": "kubernetes-nats-1",
        "current": true,
        "offline": true,
        "active": 6649434164
      }
    ]
  },
  "ts": "2024-10-02T17:43:02.197390102Z"
}

Consumer

nats consumer info natse2e#interest natse2e-interest-consumer --json
{
  "stream_name": "natse2e#interest",
  "name": "natse2e-interest-consumer",
  "config": {
    "ack_policy": "explicit",
    "ack_wait": 30000000000,
    "deliver_policy": "all",
    "durable_name": "natse2e-interest-consumer",
    "name": "natse2e-interest-consumer",
    "filter_subject": "natse2e#interest",
    "max_ack_pending": 1000,
    "max_deliver": -1,
    "max_waiting": 512,
    "replay_policy": "instant",
    "num_replicas": 0
  },
  "created": "2024-10-01T21:04:26.947979341Z",
  "delivered": {
    "consumer_seq": 47542664,
    "stream_seq": 47746164,
    "last_active": "2024-10-02T17:42:49.801192217Z"
  },
  "ack_floor": {
    "consumer_seq": 47542566,
    "stream_seq": 47746066,
    "last_active": "2024-10-02T17:42:55.559765209Z"
  },
  "num_ack_pending": 98,
  "num_redelivered": 0,
  "num_waiting": 0,
  "num_pending": 304306761,
  "cluster": {
    "name": "kubernetes-nats",
    "leader": "kubernetes-nats-2",
    "replicas": [
      {
        "name": "Server name unknown at this time (peerID: 0EXN8Tyd)",
        "current": false,
        "offline": true,
        "active": 0
      },
      {
        "name": "kubernetes-nats-1",
        "current": true,
        "active": 493106564
      }
    ]
  },
  "ts": "2024-10-02T17:45:29.229527577Z"
}

derekcollison · 2024-10-03T00:44:08Z

Can you share the tarball of that directory?

jzhn · 2024-10-03T03:07:24Z

@derekcollison : the tarball will be ~70GB in size. I'm generating it. what's your preferred way to receive this file?

derekcollison · 2024-10-03T03:15:23Z

wormhole or croc

jzhn · 2024-10-03T03:49:13Z

@derekcollison : i just sent you an URL through email, please take a look and let us know once you finish downloading. Thanks.

This fixes that condition and also adds in compaction logic once we know the tombstones are no longer relevant. Resolves: #5929 (Why we have large block exceeding maximum block size) Signed-off-by: Derek Collison <[email protected]>

jzhn added the defect Suspected defect such as a bug or regression label Sep 25, 2024

derekcollison mentioned this issue Oct 8, 2024

[FIXED] When writing tombstones we could violate max block size. #5973

Merged

wallyqs changed the title ~~Unrecoverable OOMKill during stream/consumer recovery on startup~~ Unrecoverable OOMKill during stream/consumer recovery on startup [v2.10.20] Oct 8, 2024

derekcollison closed this as completed in #5973 Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrecoverable OOMKill during stream/consumer recovery on startup [v2.10.20] #5929

Unrecoverable OOMKill during stream/consumer recovery on startup [v2.10.20] #5929

jzhn commented Sep 25, 2024 •

edited

Loading

derekcollison commented Sep 26, 2024

jzhn commented Sep 26, 2024 •

edited

Loading

kozlovic commented Sep 27, 2024

neilalexander commented Sep 27, 2024

kozlovic commented Sep 27, 2024

jzhn commented Sep 27, 2024

kozlovic commented Sep 28, 2024

jzhn commented Oct 2, 2024

wallyqs commented Oct 2, 2024

jzhn commented Oct 2, 2024

derekcollison commented Oct 3, 2024

jzhn commented Oct 3, 2024

derekcollison commented Oct 3, 2024

jzhn commented Oct 3, 2024

Unrecoverable OOMKill during stream/consumer recovery on startup [v2.10.20] #5929

Unrecoverable OOMKill during stream/consumer recovery on startup [v2.10.20] #5929

Comments

jzhn commented Sep 25, 2024 • edited Loading

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

Notes

pprof memory dump

derekcollison commented Sep 26, 2024

jzhn commented Sep 26, 2024 • edited Loading

kozlovic commented Sep 27, 2024

neilalexander commented Sep 27, 2024

kozlovic commented Sep 27, 2024

jzhn commented Sep 27, 2024

kozlovic commented Sep 28, 2024

jzhn commented Oct 2, 2024

wallyqs commented Oct 2, 2024

jzhn commented Oct 2, 2024

Stream

Consumer

derekcollison commented Oct 3, 2024

jzhn commented Oct 3, 2024

derekcollison commented Oct 3, 2024

jzhn commented Oct 3, 2024

jzhn commented Sep 25, 2024 •

edited

Loading

jzhn commented Sep 26, 2024 •

edited

Loading