[DocDB] tserver crashed while inserting data into the cluster #10621

sonalsagarwal · 2021-11-17T17:34:55Z

Jira Link: DB-1006

Description

Observed this issue in the TPC-C 100k run, where one of the client threads failed while loading the data.
Its a 63-node cluster with 3 master nodes and 60 tserver nodes.
40 clients were loading 100k WH TPC-C data(each client loading 2500WH)
DB version for cluster : 2.9.2.0_b104

Flags :
memstore_size_mb 256
global_memstore_size_mb_max 4096
tserver_tcmalloc_max_total_thread_cache_bytes 1073741824
rocksdb_max_write_buffer_number 10

The reason for failure was : Connection closed error. This happened around 12:36:50.
Error in client :

Exception in thread “main" java.lang.RuntimeException: Failed to execute threads: org.postgresql.util.PSQLException: This connection has been closed.
	at com.oltpbenchmark.util.ThreadUtil.runNewPool(ThreadUtil.java:81)
	at com.oltpbenchmark.api.BenchmarkModule.loadDatabase(BenchmarkModule.java:221)
	at com.oltpbenchmark.DBWorkload.runLoader(DBWorkload.java:458)
	at com.oltpbenchmark.DBWorkload.main(DBWorkload.java:358)
Caused by: java.lang.RuntimeException: org.postgresql.util.PSQLException: This connection has been closed.
	at com.oltpbenchmark.api.Loader$LoaderThread.run(Loader.java:70)
	at com.oltpbenchmark.util.ThreadUtil$LatchRunnable.run(ThreadUtil.java:117)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.postgresql.util.PSQLException: This connection has been closed.
	at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:793)
	at org.postgresql.jdbc.PgConnection.commit(PgConnection.java:779)
	at com.oltpbenchmark.api.Loader$LoaderThread.run(Loader.java:63)
	... 4 more

Checked the logs of one of the tservers (172.151.58.196), it had below message in warning log file:

W1117 12:36:49.463980 35966 tablet_rpc.cc:450] Timed out (yb/rpc/outbound_call.cc:501): Failed UpdateTransaction: tablet_id: "ab44776dbdf14673a80f802529e32fc8" state { transaction_id: "\t_\257qz\213M\222\226|ic\321\304\001?" status: APPLYING tablets: "f29f0b0fba4d4320b028c7bfb875c42a" commit_hybrid_time: 6705777047196700672 sealed: false aborted { } } propagated_hybrid_time: 6705777067885903872, retrier: { task_id: -1 state: kIdle deadline: 7776.420s } to tablet ab44776dbdf14673a80f802529e32fc8 on tablet server { uuid: 6be590b5174b4a31967588d433a9bf19 private: [host: "172.151.48.222" port: 9100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2c" after 1 attempt(s): UpdateTransaction RPC (request call id 101668698) to 172.151.48.222:9100 timed out after 4.999s

On the server 172.151.48.222, there are no Errors. But this warning logs has the following message, which indicates some crash :

W1117 12:37:01.903275 47301 log_util.cc:221] Could not read footer for segment: /mnt/d1/yb-data/tserver/wals/table-fde6b822741e4ab890209eb89a2c9543/tablet-e7090f7b29da4a608f40b238835dbd3a/wal-000000022: Not found (yb/consensus/log_util.cc:463): Footer not found. Footer magic doesn't match
W1117 12:37:01.903398 47301 log_reader.cc:197] T e7090f7b29da4a608f40b238835dbd3a P 6be590b5174b4a31967588d433a9bf19: Log segment /mnt/d1/yb-data/tserver/wals/table-fde6b822741e4ab890209eb89a2c9543/tablet-e7090f7b29da4a608f40b238835dbd3a/wal-000000022 was likely left in-progress after a previous crash. Will try to rebuild footer by scanning data.
W1117 12:37:01.903419 47297 log_util.cc:221] Could not read footer for segment: /mnt/d1/yb-data/tserver/wals/table-fde6b822741e4ab890209eb89a2c9543/tablet-609afb43a2374b388ec87d931f35b6d9/wal-000000021: Not found (yb/consensus/log_util.cc:463): Footer not found. Footer magic doesn't match
W1117 12:37:01.903276 47303 log_util.cc:221] Could not read footer for segment: /mnt/d1/yb-data/tserver/wals/table-fde6b822741e4ab890209eb89a2c9543/tablet-ce7e3ec6df0942b59513366e0580ca1e/wal-000000024: Not found (yb/consensus/log_util.cc:463): Footer not found. Footer magic doesn't match

This tserver 172.151.48.222 had a crash and was restarted. The core dump is as below.

#0  rocksdb::Block::NumRestarts (this=this@entry=0x6ed06960) at ../../src/yb/rocksdb/table/block.cc:491
491	../../src/yb/rocksdb/table/block.cc: No such file or directory.
(gdb) bt
#0  rocksdb::Block::NumRestarts (this=this@entry=0x6ed06960) at ../../src/yb/rocksdb/table/block.cc:491
#1  0x00007f78d36f2b82 in rocksdb::Block::NewIterator (this=0x6ed06960, cmp=0x18346e50, key_value_encoding_format=rocksdb::kKeyDeltaEncodingSharedPrefix, iter=iter@entry=0x0, total_order_seek=total_order_seek@entry=true)
    at ../../src/yb/rocksdb/table/block.cc:522
#2  0x00007f78d36ea92f in rocksdb::BlockBasedTable::NewDataBlockIterator (this=0x192cf6cb0, ro=..., index_value=..., block_type=rocksdb::kIndex, input_iter=0x0) at ../../src/yb/rocksdb/table/block_based_table_reader.cc:1256
#3  0x00007f78d3704dde in InitSubIterator (parent_iter=0x31477e4c8, this=0x31477e480) at ../../src/yb/rocksdb/table/index_reader.cc:271
#4  DoSeek<std::_Bind<std::_Mem_fn<void (rocksdb::IteratorWrapper::*)()>(std::_Placeholder<1>)> > (seek_function=..., this=0x31477e480) at ../../src/yb/rocksdb/table/index_reader.cc:231
#5  rocksdb::MultiLevelIterator::SeekToFirst (this=0x31477e480) at ../../src/yb/rocksdb/table/index_reader.cc:165
#6  0x00007f78d372b816 in SeekToFirst (this=0x8028f110) at ../../src/yb/rocksdb/table/iterator_wrapper.h:127
#7  rocksdb::(anonymous namespace)::TwoLevelIterator::SeekToFirst (this=0x8028f0e0) at ../../src/yb/rocksdb/table/two_level_iterator.cc:130
#8  0x00007f78d3709595 in SeekToFirst (this=<optimized out>) at ../../src/yb/rocksdb/table/iterator_wrapper.h:127
#9  rocksdb::MergingIterator::SeekToFirst (this=0x5151f200) at ../../src/yb/rocksdb/table/merger.cc:97
#10 0x00007f78d35f3c9d in rocksdb::CompactionJob::ProcessKeyValueCompaction (this=this@entry=0x7f7589192d70, holder=holder@entry=0x7f7589192930, sub_compact=0xaae634a0) at ../../src/yb/rocksdb/db/compaction_job.cc:631
#11 0x00007f78d35f509a in rocksdb::CompactionJob::Run (this=this@entry=0x7f7589192d70) at ../../src/yb/rocksdb/db/compaction_job.cc:461
#12 0x00007f78d3631a32 in rocksdb::DBImpl::BackgroundCompaction (this=this@entry=0x17ff1200, made_progress=made_progress@entry=0x7f75891931e7, job_context=job_context@entry=0x7f7589193250, log_buffer=log_buffer@entry=0x7f7589193420,
    manual_compaction=manual_compaction@entry=0x0, compaction=...) at ../../src/yb/rocksdb/db/db_impl.cc:3626
#13 0x00007f78d36335f7 in rocksdb::DBImpl::BackgroundCallCompaction (this=0x17ff1200, m=0x0, compaction=..., compaction_task=0xd1dd7e80) at ../../src/yb/rocksdb/db/db_impl.cc:3351
#14 0x00007f78d36470c8 in rocksdb::DBImpl::CompactionTask::DoRun (this=<optimized out>, suspender=<optimized out>) at ../../src/yb/rocksdb/db/db_impl.cc:265
#15 0x00007f78d3639a79 in rocksdb::DBImpl::ThreadPoolTask::Run (this=0xd1dd7e80, status=..., suspender=<optimized out>) at ../../src/yb/rocksdb/db/db_impl.cc:219
#16 0x00007f78d024c868 in yb::(anonymous namespace)::PriorityThreadPoolWorker::Run (this=0x2417bb98) at ../../src/yb/util/priority_thread_pool.cc:213
#17 0x00007f78d027ff2f in operator() (this=0x3f041eb8) at /home/yugabyte/yb-software/yugabyte-2.9.2.0-b104-centos-x86_64/linuxbrew-xxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
#18 yb::Thread::SuperviseThread (arg=0x3f041e60) at ../../src/yb/util/thread.cc:771
#19 0x00007f78ca51c694 in start_thread (arg=0x7f758919c700) at pthread_create.c:333
#20 0x00007f78c9c5941d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The text was updated successfully, but these errors were encountered:

rthallamko3 · 2024-05-06T22:23:44Z

DUP of #20852

sonalsagarwal added the area/docdb YugabyteDB core features label Nov 17, 2021

sonalsagarwal assigned bmatican Nov 17, 2021

sonalsagarwal changed the title ~~[DocDB] tserver crashed while insert data into the cluster~~ [DocDB] tserver crashed while inserting data into the cluster Nov 17, 2021

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jun 8, 2022

bmatican assigned ttyusupov and rthallamko3 and unassigned bmatican Jul 19, 2022

rthallamko3 removed their assignment May 9, 2023

yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue labels Sep 22, 2023

arybochkin mentioned this issue Jan 30, 2024

[DocDB] Block cache key generation algorithm may return already existing key #20852

Closed

1 task

rthallamko3 closed this as completed May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] tserver crashed while inserting data into the cluster #10621

[DocDB] tserver crashed while inserting data into the cluster #10621

sonalsagarwal commented Nov 17, 2021 •

edited by yugabyte-ci

Loading

rthallamko3 commented May 6, 2024

[DocDB] tserver crashed while inserting data into the cluster #10621

[DocDB] tserver crashed while inserting data into the cluster #10621

Comments

sonalsagarwal commented Nov 17, 2021 • edited by yugabyte-ci Loading

Description

rthallamko3 commented May 6, 2024

sonalsagarwal commented Nov 17, 2021 •

edited by yugabyte-ci

Loading