bug: new queries cannot be submitted when one of coordinator disconnected in specific case #30

hfsugar · 2021-04-28T03:56:49Z

Software Environment:

OpenLooKeng version (source or binary):
latest version
OS platform & distribution (eg., Linux Ubuntu 16.04):
linux 4.9.0-8-amd64 Bump libthrift from 0.9.3-1 to 0.12.0 in /hetu-heuristic-index #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64 GNU/Linux
Java version:
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)

Describe the current behavior

After submitting big query which leads to the corruption of one coordinator, new small query cannot be submitted successfully.

Describe the expected behavior

The corruption of one coordinator doesn't affect other new queries.

Steps to reproduce the issue

set node-scheduler.include-coordinator=true and you'd better set experimental.reserved-pool-enabled=false, otherwise the coordinator may not be corrupted.
submit a big query which leads to the corruption of one coordinator directly Or using "kill -9 prestoServerPID" on that coordinator processing this query if you are sure the coordinator has no enough memories for new queries.
submit new queries to other coordinators.

Related log/screenshots

lk> select 1;

Query 20210428_035158_00002_4tzge, QUEUED, 0 nodes, 0 splits

Special notes for this issue

Why new query cannot be submitted?
The related bug is when new query be submitted, group.canRunMore() always returns false.
So that new query always be queued.

Why group.canRunMore() always returns false?
Because query stats in Hazelcast not updated when big query leads to corruption of one coordinator.
The fact is that client gets server gone after big query submitted, but query status in Hazelcast is always running , which affects canRunMore of other new queries.

haochending · 2021-04-28T20:49:14Z

I think we have query state expiration config and by default it's 10 seconds. Basically if the query state of that running query hasn't been updated for more than 10 seconds we won't count that query. But for this to work it's required that all the coordinators' time are synced up, can you check this?

hfsugar · 2021-04-29T02:46:23Z

I have checked that all the coordinators' time are synced up already. @haochending

hfsugar · 2021-04-29T08:58:58Z

I found StateFetcher did handleExpiredQueryState only in OOM_QUERY_STATE_COLLECTION_NAM and FINISHED_QUERY_STATE_COLLECTION_NAM. Is there miss QUERY_STATE_COLLECTION_NAME? So that running queries in QUERY_STATE_COLLECTION_NAME never be cleaned. I think expired queries in QUERY_STATE_COLLECTION_NAME also need to be updated to failed. @haochending

haochending · 2021-05-06T14:22:48Z

@hfsugar I think the code has already been refactored in the latest master branch and handleExpiredQueryState is getting called whenever we are deserializing the query states, can you give that a try?

hfsugar · 2021-05-08T03:22:55Z

OK. I have tried clean expired data in QUERY_STATE_COLLECTION_NAME and it works. Thanks!

haochending · 2021-05-28T13:26:37Z

/sync

hfsugar · 2021-06-01T13:49:08Z

Has been fixed.

hfsugar mentioned this issue May 11, 2021

fix cleanExpireState #32

Closed

hfsugar closed this as completed Jun 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: new queries cannot be submitted when one of coordinator disconnected in specific case #30

bug: new queries cannot be submitted when one of coordinator disconnected in specific case #30

hfsugar commented Apr 28, 2021 •

edited

Loading

haochending commented Apr 28, 2021

hfsugar commented Apr 29, 2021

hfsugar commented Apr 29, 2021

haochending commented May 6, 2021

hfsugar commented May 8, 2021

haochending commented May 28, 2021

hfsugar commented Jun 1, 2021

bug: new queries cannot be submitted when one of coordinator disconnected in specific case #30

bug: new queries cannot be submitted when one of coordinator disconnected in specific case #30

Comments

hfsugar commented Apr 28, 2021 • edited Loading

Software Environment:

Describe the current behavior

Describe the expected behavior

Steps to reproduce the issue

Related log/screenshots

Special notes for this issue

haochending commented Apr 28, 2021

hfsugar commented Apr 29, 2021

hfsugar commented Apr 29, 2021

haochending commented May 6, 2021

hfsugar commented May 8, 2021

haochending commented May 28, 2021

hfsugar commented Jun 1, 2021

hfsugar commented Apr 28, 2021 •

edited

Loading