Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some thoughts after running and inspecting TiDB #134

Closed
unw9527 opened this issue Jul 5, 2022 · 3 comments
Closed

Some thoughts after running and inspecting TiDB #134

unw9527 opened this issue Jul 5, 2022 · 3 comments
Assignees

Comments

@unw9527
Copy link
Contributor

unw9527 commented Jul 5, 2022

Just write down some thoughts after running and inspecting half of TiDB results so that everyone can see.

  1. A problem we might want to solve is, when setting the number of clusters to 2, we cannot get 2x speed. When running tidb, one thread still have more than 300 test cases to process while the other thread has already finished (and the early finished thread became idle afterwards, leaving only one thread working for two more days).
    One possible solution is to randomize test cases to make the complexity/failing rate of cases similar on the two threads. I will try to implement this before starting the next run of tidb. But I am not sure how much this solution can help. Welcome any further suggestions on this.
  2. It seems that I cannot run TiDB using 4 clusters. When running with 4 clusters, 2 clusters are never up and eventually throw a runtime error. Right now, running with 2 clusters almost costs a week to finish all test cases. I am not sure whether this is a known limitation of our multi-cluster design. I could share the log if anyone wants to get some insight.
  3. Another issue is about the implicit requirement: for creating pods, only replicas are explicitly required in the tidb crd. However, storage is also required, which is not in the crd. This causes more than 100 alarms in Acto, as Acto changes only replicas, which results in no change in the system state. Tidb also throws no errors in the log indicating that storage is required. I have done some experiments running Acto with a seed that explicitly adds both storage and replicas. And it seems that the change in replicas can be reflected in the system state under this setting. I remember @tylergu said before that implicit requirement is a grey zone and probably worth discussing with prof @tianyin before any conclusion is made.

PS. It seems that one tidb’s developer has replied to the PR I submitted previously. However, I am not sure whether his suggestion is correct, since his suggestion will only raise an error if MaxFailoverCount>0. But even when MaxFailoverCount=0, we probably want to raise an error as long as pdDeletedFailureReplicas>0. What do you think? @tylergu

@tylergu
Copy link
Member

tylergu commented Jul 5, 2022

@unw9527

  1. Yeah I think randomization of the test case before partitioning can help solve the skewness among the several workers. I will implement this.
  2. I think TiDB should be able to run with more than 2 workers, let me try to run it on riverrun tomorrow and see it works on riverrun. This is definitely not a limitation of the multi-cluster design, but more of a technical challenge to enable the multi-cluster. I think it's likely to be caused by some system configuration problems.

Another issue is about the implicit requirement: for creating pods, only replicas are explicitly required in the tidb crd. However, storage is also required, which is not in the crd.

  1. So the problem is that there is a field FIELD which has two subfields, replicas and storage. And this storage field is a required field, otherwise the FIELD is invalid and rejected by the operator silently. I think this is problematic for the operator, and can be fixed by just setting the storage field a required field for the FIELD. And I am pretty sure the developers will happily accept the PR if we fix it for them. Let's discuss how to fix this tomorrow in person.

It seems that one tidb’s developer has replied to the pingcap/tidb-operator#4608 I submitted previously. However, I am not sure whether his suggestion is correct, since his suggestion will only raise an error if MaxFailoverCount>0.

  1. I think there is some misunderstanding going on. The variable *tc.Spec.PD.MaxFailoverCount is the specification from the CR indicating the MaxFailoverCount that this cluster should tolerate, and the variable pdDeletedFailureReplicas is the failure count that has already happened. This PR is to solve the problem that, when users specify *tc.Spec.PD.MaxFailoverCount to be 0, they are indicating to turn off the failover feature of PD. So the fix should be that, when users specify *tc.Spec.PD.MaxFailoverCount to be 0, the error message should not be thrown. I think the developer's understanding is correct, this error message should only be thrown when the pdDeletedFailureReplicas reached the MaxFailoverCount and the MaxFailoverCount is not 0.

I agree with you that checking pdDeletedFailureReplicas>0 has other benefits, but I don't think the error message is suitable for other cases. The error message is purely just indicating that the failover budget is used up, and our fix is to mute the error message when users explicitely turn off the failover feature. Let's discuss and send a reply tomorrow.

@unw9527
Copy link
Contributor Author

unw9527 commented Jul 17, 2022

Update:

  1. This is solved.
  2. This is probably a hardware issue, which we do not know the root cause. The current solution is to use another machine to run TiDB with 4 clusters.
  3. Already issued and PR is created: validate spec.pump.resources.request.storage pingcap/tidb-operator#4633

@unw9527 unw9527 closed this as completed Jul 17, 2022
@tianyin
Copy link
Member

tianyin commented Jul 17, 2022

This is probably a hardware issue

There is never a hardware issue -- only SW issues...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants