Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge prop mismatch and missing edge(including in/out edges) with toss flag turn on #3030

Closed
kikimo opened this issue Oct 11, 2021 · 4 comments
Assignees
Labels
type/bug Type: something is unexpected
Milestone

Comments

@kikimo
Copy link
Contributor

kikimo commented Oct 11, 2021

Please check the FAQ documentation before raising an issue

Please check the FAQ documentation and old issues before raising an issue in case someone has asked the same question that you are asking.

Describe the bug (must be provided)

Edge prop mismatch and missing edge(including in/out edges) under with toss flag turn on

Your Environments (must be provided)

  • OS: Linux vesoft-192-168-15-11 5.4.151-1.el7.elrepo.x86_64 Parser framework #1 SMP Tue Oct 5 10:21:01 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Compliler: g++ (Nebula Graph Build) 7.5.0
  • CPU: Model name: Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz
  • Commit id e642c05

How To Reproduce(must be provided)

Steps to reproduce the behavior:

  1. a cluster 5storage + 1meta + 1graph
  2. change leader every 1.5s
  3. 1024 concurrent clients with customized client api sdk(in order to switch to newly elected leader quickly) inserting 4096 distinct edges.

Expected behavior

4096 edges, with same props on each edge's in/out edges/

Additional context

Provide logs and configs, or any other context to trace the problem.

what we found in the result check:

found 4090 forward edges
found 4090 backward edges
found 4090 edges in total
found 4090 semi-normal edges
prop mismatch in edge: 14->33, forward prop: 334->929, backward: 693->929
prop mismatch in edge: 14->26, forward prop: 243->922, backward: 1013->922
prop mismatch in edge: 13->11, forward prop: 436->843, backward: 31->843
prop mismatch in edge: 14->40, forward prop: 58->936, backward: 226->936
prop mismatch in edge: 10->61, forward prop: 64->701, backward: 990->701
prop mismatch in edge: 18->20, forward prop: 896->1172, backward: 210->1172
prop mismatch in edge: 14->35, forward prop: 694->931, backward: 558->931
prop mismatch in edge: 25->44, forward prop: 427->1644, backward: 400->1644
prop mismatch in edge: 30->27, forward prop: 541->1947, backward: 487->1947
prop mismatch in edge: 18->5, forward prop: 148->1157, backward: 75->1157
prop mismatch in edge: 14->30, forward prop: 357->926, backward: 570->926
prop mismatch in edge: 33->28, forward prop: 654->2140, backward: 978->2140
prop mismatch in edge: 14->37, forward prop: 88->933, backward: 993->933
vertexes: 64
found 6 missing edges:
0->0
0->1
0->2
14->27
14->38
21->40
@kikimo kikimo added the type/bug Type: something is unexpected label Oct 11, 2021
@kikimo kikimo added this to the v2.6.0 milestone Oct 11, 2021
@kikimo kikimo changed the title Edge prop mismatch and missing edge(including in/out edges) under with toss flag turn on Edge prop mismatch and missing edge(including in/out edges) with toss flag turn on Oct 12, 2021
@liuyu85cn
Copy link
Contributor

liuyu85cn commented Oct 12, 2021

This happens only if a raft partition's leader change from Host A to Host B, then change back to Host A.

TOSS has a constrain that when any raft peer elected as leader, it will do a scan prime to see if there are any prime in KV store. after that scan, it will set its partition id in a toss whitelist.

But if the leader change too quickly to erase the leader lost partition id from that whitelist. And a new insert request come.
It will overwrite the prime(then erase it)

Which mean this may happen in the test application, because it will call storage thrift interface directly.

But won't happen in the normal graph client. (It need some delay to know the new leader, and is enough to start a prime scan)

@kikimo
Copy link
Contributor Author

kikimo commented Oct 13, 2021

This happens only if a raft partition's leader change from Host A to Host B, then change back to Host A.

TOSS has a constrain that when any raft peer elected as leader, it will do a scan prime to see if there are any prime in KV store. after that scan, it will set its partition id in a toss whitelist.

But if the leader change too quickly to erase the leader lost partition id from that whitelist. And a new insert request come. It will overwrite the prime(then erase it)

Which mean this may happen in the test application, because it will call storage thrift interface directly.

But won't happen in the normal graph client. (It need some delay to know the new leader, and is enough to start a prime scan)

But I think this is still a problem that we should fix, beacause:

  1. technically it's still a bug in our code logic and our user might confront —— you cannot really gurantee what will happend under concurrent situation unless the code is right in design and implementation
  2. current graph client is highly unsatisfying especially when leader change happend, think we need to improve the sdk in the future, and that's why the test suit give up on graph client and call storage rpc directly, the storage client of test suit is totally redesigned to mitigate the leader change problem.

@liuyu85cn
Copy link
Contributor

Agree, just write something to explain this is not a big problem in 2.6 (because graph client will cover this).

also trying to fix in 2.6 if possible.

@liuyu85cn
Copy link
Contributor

Got something, If raft leader replicaLog succeed, but then, leader(term) changed.

It will report to processor that this log failed. (Code: TERM_OUT_OF_DATE, will transform as E_CONSENSUS_ERROR )

But, as this log is already sent to two followers(one of them will be leader), this log will be commit at last.

Looks like we need distinguish a real TERM_OUT_OF_DATE(before replica),

or TERM_CHANGED_AFTER_REPLICATE.

we don't have enough time to well test this, please move to 3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

4 participants