Regression - Text registry migration broken on master #3876

DP19 · 2023-08-15T04:42:31Z

What happened:

Updating from pre v0.12.0 to latest master causes records to be attempted to be deleted that don't exist yet.

msg="Del records: a-some-record.some-domain.com. TXT [\"heritage=external-dns,external-dns/owner=cluster-1,external-dns/resource=virtualservice/ns/vs\"] 300"
...
"googleapi: Error 404: The 'entity.change.deletions[a-some-record.some-domain.com.][TXT]' resource named 'a-some-record.some-domain.com. (TXT)' does not exist...

What you expected to happen:
The migration to the new txt records in v0.12.0 should apply without issue

How to reproduce it (as minimally and precisely as possible):
Run external dns in an environment pre v0.12.0 with --registry=txt
Run latest master with --registry=txt to see attempts to delete records that don't exist yet

Anything else we need to know?:

Admittedly I don't know if this is for all providers; but it is happening for the google provider in my environment. The names in the log output have been changed but it's the gist of what's happening.

I stepped through the code and I see that it seems to be related to the move to handle this migration via a ForceUpdate strategy in #3635

These two lines end up with the same records to change and in the txt registry code they both generate the new recordType-record TXT:
https://github.com/kubernetes-sigs/external-dns/blob/master/registry/txt.go#L277
https://github.com/kubernetes-sigs/external-dns/blob/master/registry/txt.go#L286

This ends up submitting a delete and recreate of the new TXT record when the recordType-record TXT entry doesn't exist yet.

Environment:

External-DNS version (use external-dns --version): master
DNS provider: google
Others: flags - --policy=sync --registry=txt --source=istio-virtualservice --domain-filter=some-domain.com --provider=google --txt-owner-id=cluster-1

The text was updated successfully, but these errors were encountered:

DP19 · 2023-08-16T15:27:45Z

Wrote a simple test to show the issue

func TestTXTMigration(t *testing.T) {
	ctx := context.Background()
	p := inmemory.NewInMemoryProvider()
	p.CreateZone(testZone)
	p.ApplyChanges(ctx, &plan.Changes{
		Create: []*endpoint.Endpoint{
			newEndpointWithOwner("oldformat.test-zone.example.org", "foo.loadbalancer.com", endpoint.RecordTypeCNAME, ""),
			newEndpointWithOwner("oldformat.test-zone.example.org", "\"heritage=external-dns,external-dns/owner=owner\"", endpoint.RecordTypeTXT, ""),
			newEndpointWithOwner("oldformat2.test-zone.example.org", "bar.loadbalancer.com", endpoint.RecordTypeA, ""),
			newEndpointWithOwner("oldformat2.test-zone.example.org", "\"heritage=external-dns,external-dns/owner=owner\"", endpoint.RecordTypeTXT, ""),
		},
	})
	r, _ := NewTXTRegistry(p, "", "", "owner", time.Hour, "wc", []string{endpoint.RecordTypeCNAME, endpoint.RecordTypeA, endpoint.RecordTypeNS}, false, nil)
	records, err := r.Records(ctx)
	assert.NoError(t, err)

	testPlan := plan.Plan{
		Current: records,
		Desired: []*endpoint.Endpoint{
			{
				DNSName:    "oldformat.test-zone.example.org",
				Targets:    endpoint.Targets{"foo.loadbalancer.com"},
				RecordType: endpoint.RecordTypeCNAME,
			},
			{
				DNSName:    "oldformat2.test-zone.example.org",
				Targets:    endpoint.Targets{"bar.loadbalancer.com"},
				RecordType: endpoint.RecordTypeA,
			},
		},
		Policies:       []plan.Policy{&plan.SyncPolicy{}},
		ManagedRecords: []string{"A", "AAAA", "CNAME"},
	}

	txtMigrationChanges := testPlan.Calculate()

	err = r.ApplyChanges(ctx, txtMigrationChanges.Changes)
	assert.NoError(t, err)
}

This breaks in two places where it's trying to updateOld and New but these new txt record types don't exist.

DP19 · 2023-09-29T03:08:49Z

This should no longer be an issue once #3774 is released

danil-smirnov · 2023-11-03T10:49:03Z

@DP19 We experience this issue when upgrading from 0.10.2 to 0.13.6:

{"level":"fatal","msg":"googleapi: Error 404: The 'entity.change.deletions[a-wg.infra.domain.tld.][TXT]' resource named 'a-wg.infra.domain.tld. (TXT)' does not exist.\nMore details:\nReason: notFound, Message: The 'entity.change.deletions[a-wg.infra.domain.tld.][TXT]' resource named 'a-wg.infra.domain.tld. (TXT)' does not exist...

Any workaround?

DP19 · 2023-11-03T11:12:12Z

@DP19 We experience this issue when upgrading from 0.10.2 to 0.13.6:

{"level":"fatal","msg":"googleapi: Error 404: The 'entity.change.deletions[a-wg.infra.domain.tld.][TXT]' resource named 'a-wg.infra.domain.tld. (TXT)' does not exist.\nMore details:\nReason: notFound, Message: The 'entity.change.deletions[a-wg.infra.domain.tld.][TXT]' resource named 'a-wg.infra.domain.tld. (TXT)' does not exist...

Any workaround?

@danil-smirnov yes, upgrade to 0.13.5 first and let it make the new txt records. You should be able to upgrade to the lastest version after that.

danil-smirnov · 2023-11-03T12:03:20Z

Thank you @DP19 , this helped!

However experiencing some errors in the logs now: #4018 (maybe unrelated)

DP19 · 2023-11-03T12:27:55Z

@danil-smirnov That might just be due to v0.13.5 not updating all of the txt records.

Can you run that version with the --once flag and see if there are any errors making updates for all of the records? This should only run a single sync and will make it easy to know if something is failing there

danil-smirnov · 2023-11-03T13:24:29Z

You are right @DP19 - I haven't applied a two-step upgrade to the Cloudflare instance. After I went through 0.13.5 error was fixed.

k8s-triage-robot · 2024-02-01T13:56:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alebedev87 · 2024-07-25T14:04:27Z

@DP19 : why the upgrade should pass specifically by 0.13.5 version? 0.13.1, 0.13.2, 0.13.4 are not good?

DP19 · 2024-07-25T14:19:57Z

@DP19 : why the upgrade should pass specifically by 0.13.5 version? 0.13.1, 0.13.2, 0.13.4 are not good?

I think they are; the migration you put in place for those versions should work without issue. I think we just picked this version because it was the last release before the change that went in and caused this issue; letting other fixes / changes also run between 0.12.0 and 0.13.5.

alebedev87 · 2024-07-26T07:40:44Z

@DP19 : Thanks for the reply!

I think we just picked this version because it was the last release before the change that went in and caused this issue

Yeah, that's what I thought too.

However, I'm still puzzled about the reasoning for not fixing the force update and waiting for the "new new" format instead.

The issue you filed here exists, needs a 2 steps migration
The new format is still there (created by the forced update)
The new new format is not merged

We could use your change until the new new format is available.

DP19 · 2024-07-26T13:34:50Z

@DP19 : Thanks for the reply!

I think we just picked this version because it was the last release before the change that went in and caused this issue

Yeah, that's what I thought too.

However, I'm still puzzled about the reasoning for not fixing the force update and waiting for the "new new" format instead.

The issue you filed here exists, needs a 2 steps migration

The new format is still there (created by the forced update)

The new new format is not merged

We could use your change until the new new format is available.

@alebedev87 I believe he thought at the time it would be merged and in by the next release; I haven't looked at the code to see if it will still work now with all of the other changes but if someone wants to take a shot at that I'd be all for it 🤷🏻

msaah-benchling · 2024-10-09T20:26:35Z

curious if the need for a two-step upgrade still exists. I've got a situation where I need to go from 0.11.x to 0.14.x or 0.15.x - is it still necessary to go through 0.13.5 in this case? thanks.

EDIT: given that #3774 isn't merged, I'm going to guess the answer is yes, a two step migration is still required?

DP19 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2023

DP19 mentioned this issue Aug 19, 2023

Change ForceUpdate strategy logic to handle recreations of new txt registry records correctly #3882

Closed

2 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024

DP19 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 4, 2024

OliverMKing mentioned this issue Apr 16, 2024

bump to external dns 0.13.5 Azure/aks-app-routing-operator#196

Merged

janavenkat mentioned this issue May 27, 2024

Unable to Upgrade to Version 0.13.5 #4506

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression - Text registry migration broken on master #3876

Regression - Text registry migration broken on master #3876

DP19 commented Aug 15, 2023 •

edited

Loading

DP19 commented Aug 16, 2023 •

edited

Loading

DP19 commented Sep 29, 2023

danil-smirnov commented Nov 3, 2023 •

edited

Loading

DP19 commented Nov 3, 2023

danil-smirnov commented Nov 3, 2023 •

edited

Loading

DP19 commented Nov 3, 2023

danil-smirnov commented Nov 3, 2023

k8s-triage-robot commented Feb 1, 2024

alebedev87 commented Jul 25, 2024

DP19 commented Jul 25, 2024

alebedev87 commented Jul 26, 2024

DP19 commented Jul 26, 2024

msaah-benchling commented Oct 9, 2024 •

edited

Loading

Regression - Text registry migration broken on master #3876

Regression - Text registry migration broken on master #3876

Comments

DP19 commented Aug 15, 2023 • edited Loading

DP19 commented Aug 16, 2023 • edited Loading

DP19 commented Sep 29, 2023

danil-smirnov commented Nov 3, 2023 • edited Loading

DP19 commented Nov 3, 2023

danil-smirnov commented Nov 3, 2023 • edited Loading

DP19 commented Nov 3, 2023

danil-smirnov commented Nov 3, 2023

k8s-triage-robot commented Feb 1, 2024

alebedev87 commented Jul 25, 2024

DP19 commented Jul 25, 2024

alebedev87 commented Jul 26, 2024

DP19 commented Jul 26, 2024

msaah-benchling commented Oct 9, 2024 • edited Loading

DP19 commented Aug 15, 2023 •

edited

Loading

DP19 commented Aug 16, 2023 •

edited

Loading

danil-smirnov commented Nov 3, 2023 •

edited

Loading

danil-smirnov commented Nov 3, 2023 •

edited

Loading

msaah-benchling commented Oct 9, 2024 •

edited

Loading