You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Observed this issue in the TPC-C 100k run, where one of the client threads failed while loading the data.
Its a 63-node cluster with 3 master nodes and 60 tserver nodes.
40 clients were loading 100k WH TPC-C data(each client loading 2500WH)
DB version for cluster : 2.9.2.0_b104
The reason for failure was : Connection closed error. This happened around 12:36:50.
Error in client :
Exception in thread “main" java.lang.RuntimeException: Failed to execute threads: org.postgresql.util.PSQLException: This connection has been closed.
at com.oltpbenchmark.util.ThreadUtil.runNewPool(ThreadUtil.java:81)
at com.oltpbenchmark.api.BenchmarkModule.loadDatabase(BenchmarkModule.java:221)
at com.oltpbenchmark.DBWorkload.runLoader(DBWorkload.java:458)
at com.oltpbenchmark.DBWorkload.main(DBWorkload.java:358)
Caused by: java.lang.RuntimeException: org.postgresql.util.PSQLException: This connection has been closed.
at com.oltpbenchmark.api.Loader$LoaderThread.run(Loader.java:70)
at com.oltpbenchmark.util.ThreadUtil$LatchRunnable.run(ThreadUtil.java:117)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.postgresql.util.PSQLException: This connection has been closed.
at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:793)
at org.postgresql.jdbc.PgConnection.commit(PgConnection.java:779)
at com.oltpbenchmark.api.Loader$LoaderThread.run(Loader.java:63)
... 4 more
Checked the logs of one of the tservers (172.151.58.196), it had below message in warning log file:
W1117 12:36:49.463980 35966 tablet_rpc.cc:450] Timed out (yb/rpc/outbound_call.cc:501): Failed UpdateTransaction: tablet_id: "ab44776dbdf14673a80f802529e32fc8" state { transaction_id: "\t_\257qz\213M\222\226|ic\321\304\001?" status: APPLYING tablets: "f29f0b0fba4d4320b028c7bfb875c42a" commit_hybrid_time: 6705777047196700672 sealed: false aborted { } } propagated_hybrid_time: 6705777067885903872, retrier: { task_id: -1 state: kIdle deadline: 7776.420s } to tablet ab44776dbdf14673a80f802529e32fc8 on tablet server { uuid: 6be590b5174b4a31967588d433a9bf19 private: [host: "172.151.48.222" port: 9100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2c" after 1 attempt(s): UpdateTransaction RPC (request call id 101668698) to 172.151.48.222:9100 timed out after 4.999s
On the server 172.151.48.222, there are no Errors. But this warning logs has the following message, which indicates some crash :
W1117 12:37:01.903275 47301 log_util.cc:221] Could not read footer for segment: /mnt/d1/yb-data/tserver/wals/table-fde6b822741e4ab890209eb89a2c9543/tablet-e7090f7b29da4a608f40b238835dbd3a/wal-000000022: Not found (yb/consensus/log_util.cc:463): Footer not found. Footer magic doesn't match
W1117 12:37:01.903398 47301 log_reader.cc:197] T e7090f7b29da4a608f40b238835dbd3a P 6be590b5174b4a31967588d433a9bf19: Log segment /mnt/d1/yb-data/tserver/wals/table-fde6b822741e4ab890209eb89a2c9543/tablet-e7090f7b29da4a608f40b238835dbd3a/wal-000000022 was likely left in-progress after a previous crash. Will try to rebuild footer by scanning data.
W1117 12:37:01.903419 47297 log_util.cc:221] Could not read footer for segment: /mnt/d1/yb-data/tserver/wals/table-fde6b822741e4ab890209eb89a2c9543/tablet-609afb43a2374b388ec87d931f35b6d9/wal-000000021: Not found (yb/consensus/log_util.cc:463): Footer not found. Footer magic doesn't match
W1117 12:37:01.903276 47303 log_util.cc:221] Could not read footer for segment: /mnt/d1/yb-data/tserver/wals/table-fde6b822741e4ab890209eb89a2c9543/tablet-ce7e3ec6df0942b59513366e0580ca1e/wal-000000024: Not found (yb/consensus/log_util.cc:463): Footer not found. Footer magic doesn't match
This tserver 172.151.48.222 had a crash and was restarted. The core dump is as below.
#0 rocksdb::Block::NumRestarts (this=this@entry=0x6ed06960) at ../../src/yb/rocksdb/table/block.cc:491
491 ../../src/yb/rocksdb/table/block.cc: No such file or directory.
(gdb) bt
#0 rocksdb::Block::NumRestarts (this=this@entry=0x6ed06960) at ../../src/yb/rocksdb/table/block.cc:491
#1 0x00007f78d36f2b82 in rocksdb::Block::NewIterator (this=0x6ed06960, cmp=0x18346e50, key_value_encoding_format=rocksdb::kKeyDeltaEncodingSharedPrefix, iter=iter@entry=0x0, total_order_seek=total_order_seek@entry=true)
at ../../src/yb/rocksdb/table/block.cc:522
#2 0x00007f78d36ea92f in rocksdb::BlockBasedTable::NewDataBlockIterator (this=0x192cf6cb0, ro=..., index_value=..., block_type=rocksdb::kIndex, input_iter=0x0) at ../../src/yb/rocksdb/table/block_based_table_reader.cc:1256
#3 0x00007f78d3704dde in InitSubIterator (parent_iter=0x31477e4c8, this=0x31477e480) at ../../src/yb/rocksdb/table/index_reader.cc:271
#4 DoSeek<std::_Bind<std::_Mem_fn<void (rocksdb::IteratorWrapper::*)()>(std::_Placeholder<1>)> > (seek_function=..., this=0x31477e480) at ../../src/yb/rocksdb/table/index_reader.cc:231
#5 rocksdb::MultiLevelIterator::SeekToFirst (this=0x31477e480) at ../../src/yb/rocksdb/table/index_reader.cc:165
#6 0x00007f78d372b816 in SeekToFirst (this=0x8028f110) at ../../src/yb/rocksdb/table/iterator_wrapper.h:127
#7 rocksdb::(anonymous namespace)::TwoLevelIterator::SeekToFirst (this=0x8028f0e0) at ../../src/yb/rocksdb/table/two_level_iterator.cc:130
#8 0x00007f78d3709595 in SeekToFirst (this=<optimized out>) at ../../src/yb/rocksdb/table/iterator_wrapper.h:127
#9 rocksdb::MergingIterator::SeekToFirst (this=0x5151f200) at ../../src/yb/rocksdb/table/merger.cc:97
#10 0x00007f78d35f3c9d in rocksdb::CompactionJob::ProcessKeyValueCompaction (this=this@entry=0x7f7589192d70, holder=holder@entry=0x7f7589192930, sub_compact=0xaae634a0) at ../../src/yb/rocksdb/db/compaction_job.cc:631
#11 0x00007f78d35f509a in rocksdb::CompactionJob::Run (this=this@entry=0x7f7589192d70) at ../../src/yb/rocksdb/db/compaction_job.cc:461
#12 0x00007f78d3631a32 in rocksdb::DBImpl::BackgroundCompaction (this=this@entry=0x17ff1200, made_progress=made_progress@entry=0x7f75891931e7, job_context=job_context@entry=0x7f7589193250, log_buffer=log_buffer@entry=0x7f7589193420,
manual_compaction=manual_compaction@entry=0x0, compaction=...) at ../../src/yb/rocksdb/db/db_impl.cc:3626
#13 0x00007f78d36335f7 in rocksdb::DBImpl::BackgroundCallCompaction (this=0x17ff1200, m=0x0, compaction=..., compaction_task=0xd1dd7e80) at ../../src/yb/rocksdb/db/db_impl.cc:3351
#14 0x00007f78d36470c8 in rocksdb::DBImpl::CompactionTask::DoRun (this=<optimized out>, suspender=<optimized out>) at ../../src/yb/rocksdb/db/db_impl.cc:265
#15 0x00007f78d3639a79 in rocksdb::DBImpl::ThreadPoolTask::Run (this=0xd1dd7e80, status=..., suspender=<optimized out>) at ../../src/yb/rocksdb/db/db_impl.cc:219
#16 0x00007f78d024c868 in yb::(anonymous namespace)::PriorityThreadPoolWorker::Run (this=0x2417bb98) at ../../src/yb/util/priority_thread_pool.cc:213
#17 0x00007f78d027ff2f in operator() (this=0x3f041eb8) at /home/yugabyte/yb-software/yugabyte-2.9.2.0-b104-centos-x86_64/linuxbrew-xxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
#18 yb::Thread::SuperviseThread (arg=0x3f041e60) at ../../src/yb/util/thread.cc:771
#19 0x00007f78ca51c694 in start_thread (arg=0x7f758919c700) at pthread_create.c:333
#20 0x00007f78c9c5941d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
The text was updated successfully, but these errors were encountered:
sonalsagarwal
changed the title
[DocDB] tserver crashed while insert data into the cluster
[DocDB] tserver crashed while inserting data into the cluster
Nov 17, 2021
Jira Link: DB-1006
Description
Observed this issue in the TPC-C 100k run, where one of the client threads failed while loading the data.
Its a 63-node cluster with 3 master nodes and 60 tserver nodes.
40 clients were loading 100k WH TPC-C data(each client loading 2500WH)
DB version for cluster : 2.9.2.0_b104
Flags :
memstore_size_mb 256
global_memstore_size_mb_max 4096
tserver_tcmalloc_max_total_thread_cache_bytes 1073741824
rocksdb_max_write_buffer_number 10
The reason for failure was : Connection closed error. This happened around 12:36:50.
Error in client :
Checked the logs of one of the tservers (172.151.58.196), it had below message in warning log file:
On the server 172.151.48.222, there are no Errors. But this warning logs has the following message, which indicates some crash :
This tserver 172.151.48.222 had a crash and was restarted. The core dump is as below.
The text was updated successfully, but these errors were encountered: