Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PD sometimes fails to scale out #1592

Closed
cofyc opened this issue Jan 20, 2020 · 6 comments · Fixed by #1598
Closed

PD sometimes fails to scale out #1592

cofyc opened this issue Jan 20, 2020 · 6 comments · Fixed by #1598
Labels
type/bug Something isn't working

Comments

@cofyc
Copy link
Contributor

cofyc commented Jan 20, 2020

Bug Report

What version of Kubernetes are you using?

k8s: v1.12.10
statefulset: advanced statefulset 0.3.1
tidb version: v2.1.19

What version of TiDB Operator are you using?

https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/operator_ghpr_e2e_test_kind/detail/operator_ghpr_e2e_test_kind/2450/pipeline/70

the new PD (basic-v2-pd-3) cannot start, it fails with the following error repeatedly.

root@operator-ghpr-e2e-test-kind-2450-s03dg-66xsm-2x7xx:/workspace# kubectl -n tidb-cluster-1861 logs -f basic-v2-pd-3 

Name:      basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc
Address 1: 10.244.3.40 basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc.cluster.local
nslookup domain basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc.svc success
starting pd-server ...
/pd-server --data-dir=/var/lib/pd --name=basic-v2-pd-3 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc:2379 --config=/etc/pd/pd.toml  --join=http://basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380,http://basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380,http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
[2020/01/20 06:47:47.906 +00:00] [INFO] [util.go:59] ["Welcome to Placement Driver (PD)"]
[2020/01/20 06:47:47.906 +00:00] [INFO] [util.go:60] [PD] [release-version=v2.1.19]
[2020/01/20 06:47:47.906 +00:00] [INFO] [util.go:61] [PD] [git-hash=4a22762aeb7858baae04e8e1d596f577a8cc18a2]
[2020/01/20 06:47:47.906 +00:00] [INFO] [util.go:62] [PD] [git-branch=HEAD]
[2020/01/20 06:47:47.906 +00:00] [INFO] [util.go:63] [PD] [utc-build-time="2019-12-27 11:31:33"]
[2020/01/20 06:47:47.906 +00:00] [INFO] [metricutil.go:80] ["disable Prometheus push client"]
[2020/01/20 06:47:47.907 +00:00] [INFO] [server.go:108] ["PD Config"] [config="{\"client-urls\":\"http://0.0.0.0:2379\",\"peer-urls\":\"http://0.0.0.0:2380\",\"advertise-client-urls\":\"http://basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc:2379\",\"advertise-peer-urls\":\"http://basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc:2380\",\"name\":\"basic-v2-pd-3\",\"data-dir\":\"/var/lib/pd\",\"initial-cluster\":\"basic-v2-pd-1=http://basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380,basic-v2-pd-2=http://basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380,basic-v2-pd-0=http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2380\",\"initial-cluster-state\":\"existing\",\"join\":\"http://basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380,http://basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380,http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2380\",\"lease\":3,\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":false,\"file\":{\"filename\":\"\",\"log-rotate\":true,\"max-size\":0,\"max-days\":0,\"max-backups\":0},\"development\":false,\"disable-caller\":false,\"disable-stacktrace\":false,\"disable-error-verbose\":true,\"sampling\":null},\"log-file\":\"\",\"log-level\":\"\",\"tso-save-interval\":\"3s\",\"metric\":{\"job\":\"basic-v2-pd-3\",\"address\":\"\",\"interval\":\"0s\"},\"schedule\":{\"max-snapshot-count\":3,\"max-pending-peer-count\":16,\"max-merge-region-size\":0,\"max-merge-region-keys\":0,\"split-merge-interval\":\"1h0m0s\",\"patrol-region-interval\":\"100ms\",\"max-store-down-time\":\"30m0s\",\"leader-schedule-limit\":4,\"region-schedule-limit\":4,\"replica-schedule-limit\":8,\"merge-schedule-limit\":8,\"hot-region-schedule-limit\":2,\"hot-region-cache-hits-threshold\":3,\"tolerant-size-ratio\":5,\"low-space-ratio\":0.8,\"high-space-ratio\":0.6,\"disable-raft-learner\":\"false\",\"disable-remove-down-replica\":\"false\",\"disable-replace-offline-replica\":\"false\",\"disable-make-up-replica\":\"false\",\"disable-remove-extra-replica\":\"false\",\"disable-location-replacement\":\"false\",\"disable-namespace-relocation\":\"false\",\"schedulers-v2\":[{\"type\":\"balance-region\",\"args\":null,\"disable\":false},{\"type\":\"balance-leader\",\"args\":null,\"disable\":false},{\"type\":\"hot-region\",\"args\":null,\"disable\":false},{\"type\":\"label\",\"args\":null,\"disable\":false}]},\"replication\":{\"max-replicas\":3,\"location-labels\":\"rack\"},\"namespace\":{},\"cluster-version\":\"0.0.0\",\"quota-backend-bytes\":\"0 B\",\"auto-compaction-mode\":\"periodic\",\"auto-compaction-retention-v2\":\"1h\",\"TickInterval\":\"500ms\",\"ElectionInterval\":\"3s\",\"PreVote\":true,\"security\":{\"cacert-path\":\"\",\"cert-path\":\"\",\"key-path\":\"\"},\"label-property\":null,\"WarningMsgs\":null,\"namespace-classifier\":\"table\"}"]
[2020/01/20 06:47:47.911 +00:00] [INFO] [server.go:143] ["start embed etcd"]
[2020/01/20 06:47:47.911 +00:00] [INFO] [systime_mon.go:25] ["start system time monitor"]
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020/01/20 06:47:47.914 log.go:90: [info] embed: [pprof is enabled under /debug/pprof]
2020/01/20 06:47:47.914 log.go:90: [info] embed: [name = basic-v2-pd-3]
2020/01/20 06:47:47.914 log.go:90: [info] embed: [data dir = /var/lib/pd]
2020/01/20 06:47:47.914 log.go:90: [info] embed: [member dir = /var/lib/pd/member]
2020/01/20 06:47:47.914 log.go:90: [info] embed: [heartbeat = 500ms]
2020/01/20 06:47:47.914 log.go:90: [info] embed: [election = 3000ms]
2020/01/20 06:47:47.914 log.go:90: [info] embed: [snapshot count = 100000]
2020/01/20 06:47:47.914 log.go:90: [info] embed: [advertise client URLs = http://basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc:2379]
[2020/01/20 06:47:47.917 +00:00] [FATAL] [main.go:117] ["run server failed"] [error="couldn't find local name \"basic-v2-pd-3\" in the initial cluster configuration"] [stack="github.com/pingcap/pd/vendor/github.com/pingcap/log.Fatal\n\t/home/jenkins/agent/workspace/release_tidb_2.1/go/src/github.com/pingcap/pd/vendor/github.com/pingcap/log/global.go:59\nmain.main\n\t/home/jenkins/agent/workspace/release_tidb_2.1/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:117\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"]
@cofyc cofyc changed the title PD sometimes fail to scale out PD sometimes fails to scale out Jan 20, 2020
@cofyc
Copy link
Contributor Author

cofyc commented Jan 20, 2020

# pd-ctl member 
{
  "header": {
    "cluster_id": 6783896199384141460
  },
  "members": [
    {
      "name": "basic-v2-pd-1",
      "member_id": 5179349175510344119,
      "peer_urls": [
        "http://basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380"
      ],
      "client_urls": [
        "http://basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2379"
      ]
    },
    {
      "member_id": 7699799069801548718,
      "peer_urls": [
        "http://basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc:2380"
      ]
    },
    {
      "name": "basic-v2-pd-2",
      "member_id": 9154202019746378711,
      "peer_urls": [
        "http://basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380"
      ],
      "client_urls": [
        "http://basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2379"
      ]
    },
    {
      "name": "basic-v2-pd-0",
      "member_id": 18017601806834663438,
      "peer_urls": [
        "http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2380"
      ],
      "client_urls": [
        "http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2379"
      ]
    }
  ],
  "leader": {
    "name": "basic-v2-pd-0",
    "member_id": 18017601806834663438,
    "peer_urls": [
      "http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2380"
    ],
    "client_urls": [
      "http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2379"
    ]
  },
  "etcd_leader": {
    "name": "basic-v2-pd-0",
    "member_id": 18017601806834663438,
    "peer_urls": [
      "http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2380"
    ],
    "client_urls": [
      "http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2379"
    ]
  }
}

however, basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc is already in the cluster.

@cofyc
Copy link
Contributor Author

cofyc commented Jan 20, 2020

discovery logs:

# kubectl -n tidb-cluster-1861  logs -f basic-v2-discovery-594cc4f458-jc4g8
I0120 05:41:04.639619       1 version.go:38] Welcome to TiDB Operator.
I0120 05:41:04.639644       1 version.go:39] TiDB Operator Version: version.Info{GitVersion:"v1.1.0-beta.1.20+bfa1983847315d", GitCommit:"bfa1983847315df80b4c3f5d88f340de16018aa2", GitTreeState:"clean", BuildDate:"2020-01-15T08:07:16Z", GoVersion:"go1.13", Compiler:"gc", Platform:"linux/amd64"}
I0120 05:41:04.643938       1 mux.go:41] starting TiDB Discovery server, listening on 0.0.0.0:10261
I0120 05:41:46.257418       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
E0120 05:41:46.282423       1 mux.go:59] failed to discover: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
, Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.3.17:2379: connect: connection refused
E0120 05:41:46.282487       1 mux.go:61] failed to writeError: Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.3.17:2379: connect: connection refused
I0120 05:41:46.291881       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
E0120 05:41:46.300555       1 mux.go:59] failed to discover: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
, Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
E0120 05:41:46.300621       1 mux.go:61] failed to writeError: Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
I0120 05:41:46.303927       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
E0120 05:41:46.307443       1 mux.go:59] failed to discover: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
, Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
E0120 05:41:46.307480       1 mux.go:61] failed to writeError: Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
I0120 05:41:47.310612       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
E0120 05:41:47.314285       1 mux.go:59] failed to discover: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
, Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.3.17:2379: connect: connection refused
E0120 05:41:47.314316       1 mux.go:61] failed to writeError: Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.3.17:2379: connect: connection refused
I0120 05:41:50.317061       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
E0120 05:41:50.320768       1 mux.go:59] failed to discover: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
, Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
E0120 05:41:50.320794       1 mux.go:61] failed to writeError: Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
I0120 05:41:51.638813       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
E0120 05:41:51.643347       1 mux.go:59] failed to discover: basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
, Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
E0120 05:41:51.643370       1 mux.go:61] failed to writeError: Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
I0120 05:41:52.646907       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
E0120 05:41:52.652453       1 mux.go:59] failed to discover: basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
, Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.2.17:2379: connect: connection refused
E0120 05:41:52.652482       1 mux.go:61] failed to writeError: Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.2.17:2379: connect: connection refused
I0120 05:41:53.325067       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
E0120 05:41:53.330584       1 mux.go:59] failed to discover: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
, Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
E0120 05:41:53.330614       1 mux.go:61] failed to writeError: Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
I0120 05:41:54.191257       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
I0120 05:41:54.194408       1 mux.go:66] generated args for basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
: --initial-cluster=basic-v2-pd-0=http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
I0120 05:41:55.333531       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
E0120 05:41:55.337195       1 mux.go:59] failed to discover: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
, Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
E0120 05:41:55.337216       1 mux.go:61] failed to writeError: Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
I0120 05:41:55.655752       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
E0120 05:41:55.659247       1 mux.go:59] failed to discover: basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
, Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
E0120 05:41:55.659273       1 mux.go:61] failed to writeError: Get http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members: dial tcp 10.244.1.19:2379: connect: connection refused
I0120 05:41:57.662523       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
E0120 05:41:58.266220       1 mux.go:59] failed to discover: basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
, Error response 503 URL http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members,body response: no leader
E0120 05:41:58.266254       1 mux.go:61] failed to writeError: Error response 503 URL http://basic-v2-pd.tidb-cluster-1861:2379/pd/api/v1/members,body response: no leader
I0120 05:41:59.271590       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
I0120 05:41:59.276394       1 mux.go:66] generated args for basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
: --join=http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2379
I0120 05:41:59.340134       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
I0120 05:41:59.343829       1 mux.go:66] generated args for basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
: --join=http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2379
I0120 05:42:24.440755       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
I0120 05:42:24.449249       1 mux.go:66] generated args for basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
: --join=http://basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2379,http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2379
I0120 05:44:10.316296       1 discovery.go:68] advertisePeerUrl is: basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
I0120 05:44:10.333237       1 mux.go:66] generated args for basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc:2380
: --join=http://basic-v2-pd-1.basic-v2-pd-peer.tidb-cluster-1861.svc:2379,http://basic-v2-pd-2.basic-v2-pd-peer.tidb-cluster-1861.svc:2379,http://basic-v2-pd-0.basic-v2-pd-peer.tidb-cluster-1861.svc:2379

@cofyc
Copy link
Contributor Author

cofyc commented Jan 20, 2020

cc @weekface @Yisaer @aylei
this happens in e2e tests several times, I guess it indicates a race somewhere. It seems a bug in pd? the basic-v2-pd-3 entry does not have name field.

@Yisaer
Copy link
Contributor

Yisaer commented Jan 20, 2020

@cofyc It seems a known problem we've met before. tikv/pd#1520
After I dig into the pd code repo, I found that this error was throw by etcd.

https://github.com/pingcap/pd/blob/efdb65f727c60a38587f2127c198c15afdca194d/vendor/github.com/coreos/etcd/etcdserver/config.go#L181-L186

@cofyc cofyc added priority/important-longterm type/bug Something isn't working labels Jan 20, 2020
@cofyc
Copy link
Contributor Author

cofyc commented Jan 20, 2020

we can try to find out why this entry (output of pd-ctl member) does not have name field first:

    {
      "member_id": 7699799069801548718,
      "peer_urls": [
        "http://basic-v2-pd-3.basic-v2-pd-peer.tidb-cluster-1861.svc:2380"
      ]
    },

@cofyc
Copy link
Contributor Author

cofyc commented Jan 21, 2020

as @nolouch pointed out, this might be the same issue we reported before: tikv/pd#1643 and PD fixed in 3.x via tikv/pd#1643. However, the fix was not backported into 2.x. So we have issues in the 2.x CI job. We can wait for the PD 2.x to be fixed too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants