-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix possible (likely distributed) query hung #45448
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Azat Khuzhin <[email protected]>
Since we still want to join the thread, yes it will be done in dtor, but this looks better. Signed-off-by: Azat Khuzhin <[email protected]>
Recently I saw the following, the client executed long distributed query and terminated the connection, and in this case query cancellation will be done from PullingAsyncPipelineExecutor dtor, but during cancellation one of nodes sent ECONNRESET, and this leads to an exception from PullingAsyncPipelineExecutor::cancel(), and this leads to a deadlock when multiple threads waits each others, because cancel() for LazyOutputFormat wasn't called. Here is as relevant portion of logs: 2023.01.04 08:26:09.236208 [ 37968 ] {f2ed6149-146d-4a3d-874a-b0b751c7b567} <Debug> executeQuery: (from 10.61.13.253:44266, user: default) TooLongDistributedQueryToPost ... 2023.01.04 08:26:09.262424 [ 37968 ] {f2ed6149-146d-4a3d-874a-b0b751c7b567} <Trace> MergeTreeInOrderSelectProcessor: Reading 1 ranges in order from part 9_330_538_18, approx. 61440 rows starting from 0 2023.01.04 08:26:09.266399 [ 26788 ] {f2ed6149-146d-4a3d-874a-b0b751c7b567} <Trace> Connection (s4.ch:9000): Connecting. Database: (not specified). User: default 2023.01.04 08:26:09.266849 [ 26788 ] {f2ed6149-146d-4a3d-874a-b0b751c7b567} <Trace> Connection (s4.ch:9000): Connected to ClickHouse server version 22.10.1. 2023.01.04 08:26:09.267165 [ 26788 ] {f2ed6149-146d-4a3d-874a-b0b751c7b567} <Debug> Connection (s4.ch:9000): Sent data for 2 scalars, total 2 rows in 3.1587e-05 sec., 62635 rows/sec., 68.00 B (2.03 MiB/sec.), compressed 0.4594594594594595 times to 148.00 B (4.41 MiB/sec.) 2023.01.04 08:39:13.047170 [ 37968 ] {f2ed6149-146d-4a3d-874a-b0b751c7b567} <Error> PullingAsyncPipelineExecutor: Code: 210. DB::NetException: Connection reset by peer, while writing to socket (10.7.142.115:9000). (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below): 0. ./.build/./contrib/libcxx/include/exception:133: Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x1818234c in /usr/lib/debug/usr/bin/clickhouse.debug 1. ./.build/./src/Common/Exception.cpp:69: DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x1004fbda in /usr/lib/debug/usr/bin/clickhouse.debug 2. ./.build/./src/Common/NetException.h:12: DB::WriteBufferFromPocoSocket::nextImpl() @ 0x14e352f3 in /usr/lib/debug/usr/bin/clickhouse.debug 3. ./.build/./src/IO/BufferBase.h:39: DB::Connection::sendCancel() @ 0x15c21e6b in /usr/lib/debug/usr/bin/clickhouse.debug 4. ./.build/./src/Client/MultiplexedConnections.cpp:0: DB::MultiplexedConnections::sendCancel() @ 0x15c4d5b7 in /usr/lib/debug/usr/bin/clickhouse.debug 5. ./.build/./src/QueryPipeline/RemoteQueryExecutor.cpp:627: DB::RemoteQueryExecutor::tryCancel(char const*, std::__1::unique_ptr<DB::RemoteQueryExecutorReadContext, std::__1::default_delete<DB::RemoteQueryExecutorReadContext> >*) @ 0x14446c09 in /usr/lib/debug/usr/bin/clickhouse.debug 6. ./.build/./contrib/libcxx/include/__iterator/wrap_iter.h:100: DB::ExecutingGraph::cancel() @ 0x15d2c0de in /usr/lib/debug/usr/bin/clickhouse.debug 7. ./.build/./contrib/libcxx/include/__memory/unique_ptr.h:300: DB::PullingAsyncPipelineExecutor::cancel() @ 0x15d32055 in /usr/lib/debug/usr/bin/clickhouse.debug 8. ./.build/./contrib/libcxx/include/__memory/unique_ptr.h:312: DB::PullingAsyncPipelineExecutor::~PullingAsyncPipelineExecutor() @ 0x15d31f4f in /usr/lib/debug/usr/bin/clickhouse.debug 9. ./.build/./src/Server/TCPHandler.cpp:0: DB::TCPHandler::processOrdinaryQueryWithProcessors() @ 0x15cde919 in /usr/lib/debug/usr/bin/clickhouse.debug 10. ./.build/./src/Server/TCPHandler.cpp:0: DB::TCPHandler::runImpl() @ 0x15cd8554 in /usr/lib/debug/usr/bin/clickhouse.debug 11. ./.build/./src/Server/TCPHandler.cpp:1904: DB::TCPHandler::run() @ 0x15ce6479 in /usr/lib/debug/usr/bin/clickhouse.debug 12. ./.build/./contrib/poco/Net/src/TCPServerConnection.cpp:57: Poco::Net::TCPServerConnection::start() @ 0x18074f07 in /usr/lib/debug/usr/bin/clickhouse.debug 13. ./.build/./contrib/libcxx/include/__memory/unique_ptr.h:54: Poco::Net::TCPServerDispatcher::run() @ 0x180753ed in /usr/lib/debug/usr/bin/clickhouse.debug 14. ./.build/./contrib/poco/Foundation/src/ThreadPool.cpp:213: Poco::PooledThread::run() @ 0x181e3807 in /usr/lib/debug/usr/bin/clickhouse.debug 15. ./.build/./contrib/poco/Foundation/include/Poco/SharedPtr.h:156: Poco::ThreadImpl::runnableEntry(void*) @ 0x181e1483 in /usr/lib/debug/usr/bin/clickhouse.debug 16. ? @ 0x7ffff7e55fd4 in ? 17. ? @ 0x7ffff7ed666c in ? (version 22.10.1.1) And here is the state of the threads: <details> <summary>system.stack_trace</summary> ```sql SELECT arrayStringConcat(arrayMap(x -> demangle(addressToSymbol(x)), trace), '\n') AS sym FROM system.stack_trace WHERE query_id = 'f2ed6149-146d-4a3d-874a-b0b751c7b567' SETTINGS allow_introspection_functions=1 Row 1: ────── sym: pthread_cond_wait std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) bool ConcurrentBoundedQueue<DB::Chunk>::emplaceImpl<DB::Chunk>(std::__1::optional<unsigned long>, DB::Chunk&&) DB::IOutputFormat::work() DB::ExecutionThreadContext::executeTask() DB::PipelineExecutor::executeStepImpl(unsigned long, std::__1::atomic<bool>*) Row 2: ────── sym: pthread_cond_wait Poco::EventImpl::waitImpl() DB::PipelineExecutor::joinThreads() DB::PipelineExecutor::executeImpl(unsigned long) DB::PipelineExecutor::execute(unsigned long) Row 3: ────── sym: pthread_cond_wait Poco::EventImpl::waitImpl() DB::PullingAsyncPipelineExecutor::Data::~Data() DB::PullingAsyncPipelineExecutor::~PullingAsyncPipelineExecutor() DB::TCPHandler::processOrdinaryQueryWithProcessors() DB::TCPHandler::runImpl() DB::TCPHandler::run() Poco::Net::TCPServerConnection::start() Poco::Net::TCPServerDispatcher::run() Poco::PooledThread::run() Poco::ThreadImpl::runnableEntry(void*) ``` </details> Signed-off-by: Azat Khuzhin <[email protected]>
alexey-milovidov
approved these changes
Jan 24, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix possible (likely distributed) query hung
Recently I saw the following, the client executed long distributed query
and terminated the connection, and in this case query cancellation will
be done from PullingAsyncPipelineExecutor dtor, but during cancellation
one of nodes sent ECONNRESET, and this leads to an exception from
PullingAsyncPipelineExecutor::cancel(), and this leads to a deadlock
when multiple threads waits each others, because cancel() for
LazyOutputFormat wasn't called.
Here is as relevant portion of logs:
And here is the state of the threads:
system.stack_trace
Cc: @KochetovNicolai