Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: new queries cannot be submitted when one of coordinator disconnected in specific case #30

Closed
hfsugar opened this issue Apr 28, 2021 · 7 comments

Comments

@hfsugar
Copy link
Contributor

hfsugar commented Apr 28, 2021

Software Environment:

  • OpenLooKeng version (source or binary):
    latest version
  • OS platform & distribution (eg., Linux Ubuntu 16.04):
    linux 4.9.0-8-amd64 Bump libthrift from 0.9.3-1 to 0.12.0 in /hetu-heuristic-index #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64 GNU/Linux
  • Java version:
    java version "1.8.0_162"
    Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
    Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)

Describe the current behavior

After submitting big query which leads to the corruption of one coordinator, new small query cannot be submitted successfully.

Describe the expected behavior

The corruption of one coordinator doesn't affect other new queries.

Steps to reproduce the issue

  1. set node-scheduler.include-coordinator=true and you'd better set experimental.reserved-pool-enabled=false, otherwise the coordinator may not be corrupted.
  2. submit a big query which leads to the corruption of one coordinator directly Or using "kill -9 prestoServerPID" on that coordinator processing this query if you are sure the coordinator has no enough memories for new queries.
  3. submit new queries to other coordinators.

Related log/screenshots

lk> select 1;

Query 20210428_035158_00002_4tzge, QUEUED, 0 nodes, 0 splits

Special notes for this issue

Why new query cannot be submitted?
The related bug is when new query be submitted, group.canRunMore() always returns false.
So that new query always be queued.

Why group.canRunMore() always returns false?
Because query stats in Hazelcast not updated when big query leads to corruption of one coordinator.
The fact is that client gets server gone after big query submitted, but query status in Hazelcast is always running , which affects canRunMore of other new queries.

@haochending
Copy link
Contributor

I think we have query state expiration config and by default it's 10 seconds. Basically if the query state of that running query hasn't been updated for more than 10 seconds we won't count that query. But for this to work it's required that all the coordinators' time are synced up, can you check this?

@hfsugar
Copy link
Contributor Author

hfsugar commented Apr 29, 2021

I have checked that all the coordinators' time are synced up already. @haochending

@hfsugar
Copy link
Contributor Author

hfsugar commented Apr 29, 2021

I found StateFetcher did handleExpiredQueryState only in OOM_QUERY_STATE_COLLECTION_NAM and FINISHED_QUERY_STATE_COLLECTION_NAM. Is there miss QUERY_STATE_COLLECTION_NAME? So that running queries in QUERY_STATE_COLLECTION_NAME never be cleaned. I think expired queries in QUERY_STATE_COLLECTION_NAME also need to be updated to failed. @haochending

@haochending
Copy link
Contributor

@hfsugar I think the code has already been refactored in the latest master branch and handleExpiredQueryState is getting called whenever we are deserializing the query states, can you give that a try?

@hfsugar
Copy link
Contributor Author

hfsugar commented May 8, 2021

OK. I have tried clean expired data in QUERY_STATE_COLLECTION_NAME and it works. Thanks!

@haochending
Copy link
Contributor

/sync

@hfsugar
Copy link
Contributor Author

hfsugar commented Jun 1, 2021

Has been fixed.

@hfsugar hfsugar closed this as completed Jun 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants