legacyLoadExtensionFieldInfo causes unbounded memory growth with dynamically created resolvers #1521

marc-chiesa · 2023-02-09T02:20:39Z

What version of protobuf and what language are you using?
Version: go 1.28.1

What did you do?
I am periodically creating a protoregistry.Types resolver with a refreshed set of FileDescriptorProto messages containing the latest proto definitions used by various services.

We use UnmarshalOptions to specify a resolver for extensions. Every unmarshal via UnmarshalOptions with a new resolver causes in-use memory to grow. The memory continues to grow each time a new resolver is used, even though previous resolvers and their contained types are no longer referenced by our code.

What did you expect to see?

Memory should eventually reach a reasonable steady-state range for a given load.

What did you see instead?

Memory continues to grow every time a new resolver is used.

This seems to be caused by this function in codec_extension.go.

The sync.map caching does not appear to be implemented correctly (nothing ever stores to legacyExtensionFieldInfoCache, instead it stores in legacyMessageTypeCache) though the presence of even the attempted caching appears to be the root of the problem and fixing the cache would not resolve our issue.

Instead, in order to support dynamically recreated resolvers without the possibility of unbounded memory growth, it would seem the following change to getExtensionFieldInfo would be adequate because it actually does less work than the current implementation but accomplishes the same thing since legacyExtensionFieldInfoCache is empty and makeExtensionFieldInfo will run every time.

func getExtensionFieldInfo(xt protoreflect.ExtensionType) *extensionFieldInfo {
	if xi, ok := xt.(*ExtensionInfo); ok {
		xi.lazyInit()
		return xi.info
	}
	return makeExtensionFieldInfo(xt.TypeDescriptor())
}

Another possibly solution would be to add an option so skip the caching mechanism, though this would also require public options in other areas (such as mergeOptions) because they use the same code path.

The text was updated successfully, but these errors were encountered:

dsnet · 2023-02-09T07:49:44Z

This is an unfortunate side-effort of heavy use of dynamic types and in many ways is no different than heavy use of Go reflection where you dynamically create new Go types (golang/go#28783) and use them with various Go stdlib packages. Many of them also use the equivalent of a sync.Map keyed on reflect.Type and exhibit the same unbounded growth problem.

Fundamentally, we need something like golang/go#43615, where a cache knows it can reclaim an object because it is the very last reference.

marc-chiesa · 2023-02-09T12:45:55Z

You're right, a weak reference would also solve the problem wonderfully. For this specific case, however, my question is does this value need to be cached at all? I could be misunderstanding the code, but to me it looks like this caching hasn't worked as intended for 3+ years. Skipping the caching altogether would presumably retain all of the performance benefits that were realized by replacing the old cache with none of the side effects related to unbounded growth.

dsnet · 2023-02-09T18:23:58Z

does this value need to be cached at all

Possibly? It's been a while since I've been regularly involved with protobuf. I wouldn't be surprised if there is code depending on *extensionFieldInfo being comparable where comparing it to another pointer is a way to check if two things are the exact same extension.

Essentially:

xi := legacyLoadExtensionFieldInfo(xt)

guarantees that xi1 == xi2 if and only if xt1 == xt2.

caching hasn't worked as intended for 3+ years

Half the caching algorithm has never worked (and we should fix that), but to my understanding it still ensures the property before. It's just that it unfortunately always does the heavy makeExtensionFieldInfo conversion every single time.

marc-chiesa · 2023-02-09T19:40:43Z

Possibly? It's been a while since I've been regularly involved with protobuf. I wouldn't be surprised if there is code depending on *extensionFieldInfo being comparable where comparing it to another pointer is a way to check if two things are the exact same extension.

That's definitely a valid use case, though I don't see any places where that is happening based on the references to getExtensionFieldInfo (and that is the only caller of legacyLoadExtensionFieldInfo). The returned *extensionFieldInfo does not seem to be used for any comparisons with other pointers. I can't find any place where the cached values are accessed from legacyMessageTypeCache outside of legacyLoadExtensionFieldInfo either, so I don't think there would be another source for such a comparison, at least as far as I can tell.

Half the caching algorithm has never worked (and we should fix that), but to my understanding it still ensures the property before. It's just that it unfortunately always does the heavy makeExtensionFieldInfo conversion every single time

You are of course correct that as currently implemented this function will return the same pointer for a given input, which is certainly an important component of caching, though if I were to guess from the commit messages I'd assume that performance was a primary criteria. In this case, all existing tests continue to pass and the current performance would be largely unaffected by removing the cache. I've just been unable to find a case where a pointer difference would affect correctness, though I may be missing something (and there's no telling whether such a requirement might materialize in the future). Admittedly, I haven't done any performance testing with the caching fixed to say whether or not there are significant gains from skipping the makeExtensionFieldInfo call.

I have some bias here due to the way the caching implementation affects our memory profile, but I can understand the reasoning for fixing the cache rather than removing it.

jhump · 2024-01-31T17:15:47Z

Instead of a user-facing option, it would be nice if the ExtensionType could simply implement an extra optional method, like IsDynamic(), and have that disable the cache.

Workloads that make use of dynamic protos should not need to be restarted periodically due to unbounded memory usage, so this seems like a serious bug.

jhump · 2024-01-31T19:48:48Z

I've just been unable to find a case where a pointer difference would affect correctness

I've done the same exercise, of examining all usages of impl.getExtensionFieldInfo and references to impl.extensionFieldInfo and cannot find any code that uses pointer equality. Also, as mentioned above, all tests pass when the cache is removed. So I suspect it's purpose is for performance (so that every time the field is serialized/de-serialized, all of the relevant metadata is memo-ized), not for canonicalization/identity.

Another way to implement the cache of computed *impl.extensionFieldInfo metadata by protoreflect.ExtensionType is to have each implementation of protoreflect.ExtensionType store the computed info -- so the identity/address of the protoreflect.ExtensionType is all that's needed. There are currently only two concrete implementations of protoreflect.ExtensionType that I can find across both v1 and v2 repos (i.e. this one and github.com/protocolbuffers/protobuf-go):

*impl.ExtensionInfo:
This one has a few aliases across the two repos:
- *proto.ExtensionDesc (in this repo)
- *protolegacy.ExtensionDesc (in the v2 repo)
- *protoimpl.ExtensionInfo (in the v2 repo)
*dynamicpb.extensionType

This first one is already special-cased and handled this way: impl.getExtensionInfo does not use the cache for this type because the impl.ExtensionInfo struct stores it in a field named info.

So instead of something like an IsDynamic() marker method to opt out of caching (which I suggested above), users could instead embed a special type (for example, a new protoimpl.ExtensionFieldInfo) to opt into caching.

The downside to such an approach (aside from one new exported type in the total exported surface area) is that existing implementations of protoreflect.ExtensionType (other than the two in these repos) would see a reduction in performance for relevant operations (serialization/de-serialization/merging/etc), since the metadata would have to be recomputed every time. But correctness would be unaffected.

But the performance impact could be easily fixed by having them embed the new type into their implementation.

Here's some pseudo-code to riff on this idea:

// In the internal/impl package
//------------------------------

type ExtensionFieldInfo struct {
	init sync.Once
	info *extensionFieldInfo
}

func (e *ExtensionFieldInfo) getExtensionInfo(xt protoreflect.ExtensionType) *extensionFieldInfo {
	e.init.Do(func() {
		e.info = makeExtensionFieldInfo(xt.TypeDescriptor())
	})
	return e.info
}

type extensionFieldInfoHolder interface {
	getExtensionInfo(xt protoreflect.ExtensionType) *extensionFieldInfo
}

func legacyLoadExtensionFieldInfo(xt protoreflect.ExtensionType) *extensionFieldInfo {
	// !! No more legacyExtensionFieldInfoCache !!
	if holder, ok := xt.(extensionFieldInfoHolder); ok {
		return holder.getExtensionInfo(xt)
	}
	// Slow path for things that don't opt into embedding info
	return makeExtensionFieldInfo(xt.TypeDescriptor())
}

// In the protoimpl package
//------------------------------

// An exported alias for it the type above.
type ExtensionFieldInfo = impl.ExtensionFieldInfo

// In the dynamicpb package
//------------------------------

// This demonstrates how easy it is to use
type extensionType struct {
	desc extensionTypeDescriptor
	
	// cache extension metadata for performance
	protoimpl.ExtensionFieldInfo
	// just embed this type and that's it!
}

jhump · 2024-02-01T00:52:58Z

I didn't originally grok everything that @marc-chiesa was saying about the cache being broken. But now I understand.

Possibly? It's been a while since I've been regularly involved with protobuf. I wouldn't be surprised if there is code depending on *extensionFieldInfo being comparable where comparing it to another pointer is a way to check if two things are the exact same extension.

This is not possible. The bug described, that suggests caching has not worked for 3 years (now 4 since it's a year later), means that every time legacyLoadExtensionFieldInfo is called, it in fact always returns a new *extensionFieldInfo.

Essentially:
xi := legacyLoadExtensionFieldInfo(xt)
guarantees that xi1 == xi2 if and only if xt1 == xt2.

This is incorrect. The caching bug (which has existed since this code was first introduced) means that it always returns a new *extensionFieldInfo, even if supplied the same xt more than once.

So, because of the caching bug, it is clear that none of this is necessary since it doesn't actually do anything useful today. It merely functions to pin unused objects to the heap.

As far as stable identity is concerned, the dynamicpb tests verify something similar, but not quite the same. Essentially:

xt := dynamicpb.NewExtensionType(extDesc)

guarantees that xt1 == xt2 if and only if extDesc1 == extDesc2. Furthermore, the TypeDescriptor() method of the dynamic extension type provides a similar property:

xtd := xt.TypeDescriptor()

guarantees that xtd1 == xtd2 if and only if xt1 == xt2.

@dsnet, is it possible that is what you were thinking of?

That is also what prevents my suggestion above from working -- we can't embed anything mutable into dynamicpb.extensionType because then its methods would need pointer receivers. And that would require something like a cache (which means that every dynamic extension type would be immortal and lead to unbounded memory usage).

So I take back the whole proposal above. Clearly the simplest solution here is to just remove the legacyExtensionFieldInfoCache, which will have zero impact on any behavior or performance compared to the state of the main branch today.

jhump · 2024-02-01T01:21:14Z

I've proposed what I think is the safest and most sane solution here: https://go-review.googlesource.com/c/protobuf/+/560095

Resolves golang/protobuf#1521 In the internal/impl package, a helper computes metadata used for serialization and deserialization of extension values. It features a package var of type sync.Map that is used as cache. This ostensibly was for performance, however it has never worked, because the code that updates the cache inserts entries into the wrong map. (These erroneous entries do not cause any issues because they are keys that never conflict with those used in valid queries.) Instead of a one-line fix to have this code update the correct cache, this change removes the cache altogether. The existence of the cache means that once a protoreflect.ExtensionType is used to serialize or deserialize a message, it can *never* be garbage collected. Workloads that are long-lived servers using dynamic messages and extensions based on user requests will exhibit unbounded growth in memory usage as the cache only gets larger and larger. Since the cache has never worked, any advantages it ostensibly conferred have not been missed. So this fixes the unbounded memory growth instead. Change-Id: I15957fd8521852f9f7f9f89db7ebfd7170d85202

neild · 2024-02-01T15:58:12Z

tl;dr: I agree.

The offending function is:

func legacyLoadExtensionFieldInfo(xt protoreflect.ExtensionType) *extensionFieldInfo {
        if xi, ok := legacyExtensionFieldInfoCache.Load(xt); ok {
                return xi.(*extensionFieldInfo)
        }
        e := makeExtensionFieldInfo(xt.TypeDescriptor())
        if e, ok := legacyMessageTypeCache.LoadOrStore(xt, e); ok {
                return e.(*extensionFieldInfo)
        }
        return e
}

This function contains a bug: It's clearly supposed to add a new *extensionFieldInfo to legacyExtensionFieldInfoCache if one isn't already present, but it's adding it to legacyMessageTypeCache instead. As a result, the function is effectively just:

func legacyLoadExtensionFieldInfo(xt protoreflect.ExtensionType) *extensionFieldInfo {
        return makeExtensionFieldInfo(xt.TypeDescriptor())
}

In addition, while I might be missing something, I think this bug may not be user-visible. I think *extensionFieldInfo is purely used internally and is never contained in any exported types. As such, the intended caching here is purely for performance, not correctness. Since we haven't observed any performance issues stemming from the failure to cache the *extensionFieldInfo in this path, and have observed a memory issues, then dropping the cache entirely seems reasonable. (Probably with a comment referencing this issue, and a suggestion that we might want to add the cache back in if/when the stdlib ever acquires a weak map or equivalent.)

puellanivis · 2024-02-02T03:21:36Z

The unfortunate nature of the any type for key in sync.Map… if we had had some sort of type safety on the arguments, the bug never would have compiled. :(

gopherbot closed this as completed in protocolbuffers/protobuf-go@82c6b3a Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

legacyLoadExtensionFieldInfo causes unbounded memory growth with dynamically created resolvers #1521

legacyLoadExtensionFieldInfo causes unbounded memory growth with dynamically created resolvers #1521

marc-chiesa commented Feb 9, 2023 •

edited

Loading

dsnet commented Feb 9, 2023

marc-chiesa commented Feb 9, 2023 •

edited

Loading

dsnet commented Feb 9, 2023 •

edited

Loading

marc-chiesa commented Feb 9, 2023 •

edited

Loading

jhump commented Jan 31, 2024

jhump commented Jan 31, 2024

jhump commented Feb 1, 2024

jhump commented Feb 1, 2024

neild commented Feb 1, 2024

puellanivis commented Feb 2, 2024

legacyLoadExtensionFieldInfo causes unbounded memory growth with dynamically created resolvers #1521

legacyLoadExtensionFieldInfo causes unbounded memory growth with dynamically created resolvers #1521

Comments

marc-chiesa commented Feb 9, 2023 • edited Loading

dsnet commented Feb 9, 2023

marc-chiesa commented Feb 9, 2023 • edited Loading

dsnet commented Feb 9, 2023 • edited Loading

marc-chiesa commented Feb 9, 2023 • edited Loading

jhump commented Jan 31, 2024

jhump commented Jan 31, 2024

jhump commented Feb 1, 2024

jhump commented Feb 1, 2024

neild commented Feb 1, 2024

puellanivis commented Feb 2, 2024

marc-chiesa commented Feb 9, 2023 •

edited

Loading

marc-chiesa commented Feb 9, 2023 •

edited

Loading

dsnet commented Feb 9, 2023 •

edited

Loading

marc-chiesa commented Feb 9, 2023 •

edited

Loading