Cache DuplicateNameChecker in the OutputContext #2253

KenitoInc · 2021-11-18T08:55:16Z

Issues

This pull request fixes #2180 .

Description

When serializing a response, we are calling [Serializer].CreateDuplicatePropertyNameChecker() multiple times to create an instance of the DuplicatePropertyNameChecker each time. This causes lots of allocations from initializing and resizing the internal propertyState dictionary.

In this PR, we create an Instance of the DuplicatePropertyNameChecker when we create the OutputContext. This allows us to re-use the DuplicatePropertyNameChecker throughout the response serialization process.

Checklist (Uncheck if it is not completed)

Test cases added
Build and test with one-click build and test script passed

Additional work necessary

If documentation update is needed, please add "Docs Needed" label to the issue and provide details about the required document change in the issue.

habbes · 2021-11-19T09:01:38Z

Seems some tests are failing due to duplicate names? Maybe they are situations where you don't want to share the same property checker? My assumption is that this may due to nested properties (navigation or complex): maybe the same property name exists on the parent entity as well as in a nested object, so a different property checker should be used.

Maybe we can make the property checker hiearchical, instead of creating a new checker every time we enter a scope, or having a single checker for the entire response, we can create a new checker per scope but cache it so that the next time we enter the scope for the same nested property, we re-use the same checker?

@joaocpaiva had also given an object pool as another suggestion.

joaocpaiva · 2021-11-22T20:08:22Z

Seems some tests are failing due to duplicate names? Maybe they are situations where you don't want to share the same property checker? My assumption is that this may due to nested properties (navigation or complex): maybe the same property name exists on the parent entity as well as in a nested object, so a different property checker should be used.

Maybe we can make the property checker hiearchical, instead of creating a new checker every time we enter a scope, or having a single checker for the entire response, we can create a new checker per scope but cache it so that the next time we enter the scope for the same nested property, we re-use the same checker?

@joaocpaiva had also given an object pool as another suggestion.

Seems like Kennedy was just missing a Reset(). Obviously, to reuse a collection it is important to call Clear() every time we enter a new scope that needs it with an empty state. Reusing single collection per operation, should go a long way in reducing allocations for this stack.

KenitoInc · 2021-11-23T07:20:00Z

Seems some tests are failing due to duplicate names? Maybe they are situations where you don't want to share the same property checker? My assumption is that this may due to nested properties (navigation or complex): maybe the same property name exists on the parent entity as well as in a nested object, so a different property checker should be used.
Maybe we can make the property checker hiearchical, instead of creating a new checker every time we enter a scope, or having a single checker for the entire response, we can create a new checker per scope but cache it so that the next time we enter the scope for the same nested property, we re-use the same checker?
@joaocpaiva had also given an object pool as another suggestion.

Seems like Kennedy was just missing a Reset(). Obviously, to reuse a collection it is important to call Clear() every time we enter a new scope that needs it with an empty state. Reusing single collection per operation, should go a long way in reducing allocations for this stack.

I updated the code and called Reset() to fix the issue

pull-request-quantifier-deprecated · 2021-11-23T09:06:37Z

This PR has 32 quantified lines of changes. In general, a change size of upto 200 lines is ideal for the best PR experience!

Quantification details

Label      : Extra Small
Size       : +19 -13
Percentile : 12.8%

Total files changed: 8

Change summary by file extension:
.cs : +19 -13

Change counts above are quantified counts, based on the PullRequestQuantifier customizations.

Why proper sizing of changes matters

Optimal pull request sizes drive a better predictable PR flow as they strike a
balance between between PR complexity and PR review overhead. PRs within the
optimal size (typical small, or medium sized PRs) mean:

Fast and predictable releases to production:
- Optimal size changes are more likely to be reviewed faster with fewer
  iterations.
- Similarity in low PR complexity drives similar review times.
Review quality is likely higher as complexity is lower:
- Bugs are more likely to be detected.
- Code inconsistencies are more likely to be detetcted.
Knowledge sharing is improved within the participants:
- Small portions can be assimilated better.
Better engineering practices are exercised:
- Solving big problems by dividing them in well contained, smaller problems.
- Exercising separation of concerns within the code changes.

What can I do to optimize my changes

Use the PullRequestQuantifier to quantify your PR accurately
- Create a context profile for your repo using the context generator
- Exclude files that are not necessary to be reviewed or do not increase the review complexity. Example: Autogenerated code, docs, project IDE setting files, binaries, etc. Check out the Excluded section from your prquantifier.yaml context profile.
- Understand your typical change complexity, drive towards the desired complexity by adjusting the label mapping in your prquantifier.yaml context profile.
- Only use the labels that matter to you, see context specification to customize your prquantifier.yaml context profile.
Change your engineering behaviors
- For PRs that fall outside of the desired spectrum, review the details and check if:
  - Your PR could be split in smaller, self-contained PRs instead
  - Your PR only solves one particular issue. (For example, don't refactor and code new features in the same PR).

How to interpret the change counts in git diff output

One line was added: +1 -0
One line was deleted: +0 -1
One line was modified: +1 -1 (git diff doesn't know about modified, it will
interpret that line like one addition plus one deletion)
Change percentiles: Change characteristics (addition, deletion, modification)
of this PR in relation to all other PRs within the repository.

Was this comment helpful? 👍 :ok_hand: :thumbsdown: (Email)
Customize PullRequestQuantifier for this repository.

xuzhg · 2021-11-23T22:45:59Z

src/Microsoft.OData.Core/Json/JsonLightInstanceAnnotationWriter.cs

@@ -394,7 +398,7 @@ await this.WriteInstanceAnnotationNameAsync(propertyName, annotationName)
                await this.valueSerializer.WriteResourceValueAsync(resourceValue,
                    expectedType,
                    treatLikeOpenProperty,
-                    this.valueSerializer.CreateDuplicatePropertyNameChecker()).ConfigureAwait(false);
+                    this.valueSerializer.JsonLightOutputContext.DuplicatePropertyNameChecker).ConfigureAwait(false);


At a different level, it could be a "concurrency" problem?

I mean when we write a payload with multiple "levels", for example, Top Resource, Child resource, GrandChild resource...

The PropertyNameChecker is reused in each "level"? If yes, it's a big problem if Top Resource has the same property name with GrandChild resource?

habbes · 2021-11-24T15:12:56Z

@joaocpaiva @KenitoInc I still think using a shared DuplicateNameChecker when writing a resource with nested resources might be problematic even if Reset() is used. I think it would only be safe to share one instance per nesting level.

Consider this example:

{
   "Id": 1,
   "Foo": "Bar",
   "Nested": {
       "Fizz": "Buzz",
       "Foo": "Bar"
    },
   "Id": 2
}

If we use the same duplicate checker without resetting, then it will incorrectly flag the top-level Foo property and the Nested.Foo property as duplicates, even though they're not duplicates.

However, if we call Reset() on the instance before entering the Nested property and/or after leaving the Nested property, then the dictionary will be empty and it will not detect that the second Id field as being a duplicate and potentially serializing invalid json.

Since the tests are all passing, maybe we need more tests to cover these scenarios, or maybe my assumptions about how the duplicate checker is used is false.

joaocpaiva · 2021-11-24T16:49:28Z

@joaocpaiva @KenitoInc I still think using a shared DuplicateNameChecker when writing a resource with nested resources might be problematic even if Reset() is used. I think it would only be safe to share one instance per nesting level.

Consider this example:
{
   "Id": 1,
   "Foo": "Bar",
   "Nested": {
       "Fizz": "Buzz",
       "Foo": "Bar"
    },
   "Id": 2
}
If we use the same duplicate checker without resetting, then it will incorrectly flag the top-level Foo property and the Nested.Foo property as duplicates, even though they're not duplicates.

However, if we call Reset() on the instance before entering the Nested property and/or after leaving the Nested property, then the dictionary will be empty and it will not detect that the second Id field as being a duplicate and potentially serializing invalid json.

Since the tests are all passing, maybe we need more tests to cover these scenarios, or maybe my assumptions about how the duplicate checker is used is false.

Makes sense @habbes @KenitoInc. We should make sure there is a test for that use case. Yet another alternative would be to check all top level properties, before processing the nested properties, so we could reset is safely at the end of every level?

mikepizzo · 2021-11-24T17:26:20Z

Checking top level properties before writing nested properties isn't really an option, as the writer supports streaming the values so the service doesn't have to keep the entire response object in memory.

In reply to: 978052709

habbes · 2021-11-24T18:34:19Z

@joaocpaiva @mikepizzo if checking all top-level properties before nested properties is not an option, maybe we could still make significant gains from creating only one duplicate checker per nesting level if we have a response with a lot of entities with nested properties.

For example, assuming we're writing a response with 10 entities like:

[
{
    "Prop1": "value",
    "Nested1": {
         "N1Prop": "value"
     },
     "Prop2": "value",
     "Nested2": {
         "N2Prop": "value",
         "N2Nested": { ... }
      }
}
...
]

In the current implementation, I think we'll allocate 1 checker for the collection + (1 for Nested1 property + 1 for Nested2 + 1 for N2Nested) * 10 = 31 instances.

But we can safely created once instance of Nested1 and reuse it for Nested2 (after resetting) since they're at the same level and therefore do not overlap. And we can create another instance for N2Nested. We can cache these 3 instances and reuse them for all entities in the response. So in total we'll have 4 instances (including the one for the top-level collection) instead of 31, scaling only based on the depth of the response, rather than the length * number of nested properties in the response.

Also after processing an entity in the response, each dictionary would have grown to fit the largest number of properties at its nesting level, which means (I think) the dictionaries will probably not be resized after the first entity is written.

I assume this is the same improvement we'd get if we used an object pool?

mikepizzo

Current code seems to change semantics of duplicate name checking. Perhaps there's a better way to improve perf without caching/resetting.

corranrogue9 · 2022-02-24T20:09:38Z

src/Microsoft.OData.Core/DuplicatePropertyNameChecker.cs

@@ -120,7 +120,10 @@ public void ValidatePropertyOpenForAssociationLink(string propertyName)
        /// </summary>
        public void Reset()
        {
-            propertyState.Clear();
+            if (propertyState.Count > 0)


I'm not sure if there was another reason for adding this if statement, but Dictionary<TKey, TValue>.Clear already does an identical count check as an optimization

corranrogue9 · 2022-02-24T20:10:16Z

Is there any way to write a test to ensure that the caching is working correctly?

KenitoInc · 2022-03-03T07:04:13Z

Created a new PR #2328 with a different implementation

pull-request-quantifier-deprecated bot added the Extra Small label Nov 18, 2021

KenitoInc added 2 commits November 22, 2021 14:08

Cache DuplicateNameChecker in the OutputContext

afda687

Reset the PropertyState dictionary

f97e47c

KenitoInc force-pushed the fix/perf-DuplicatePropertyNameCheck branch from 18448b8 to f97e47c Compare November 22, 2021 11:08

Only call Dictionary.Reset when Count > 0

173e530

KenitoInc marked this pull request as ready for review November 23, 2021 09:06

KenitoInc requested review from odero, gathogojr, habbes, mikepizzo and xuzhg and removed request for odero November 23, 2021 09:07

KenitoInc added the Ready for review Use this label if a pull request is ready to be reviewed label Nov 23, 2021

xuzhg reviewed Nov 23, 2021

View reviewed changes

mikepizzo requested changes Jan 3, 2022

View reviewed changes

corranrogue9 reviewed Feb 24, 2022

View reviewed changes

KenitoInc closed this Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache DuplicateNameChecker in the OutputContext #2253

Cache DuplicateNameChecker in the OutputContext #2253

KenitoInc commented Nov 18, 2021 •

edited

Loading

habbes commented Nov 19, 2021

joaocpaiva commented Nov 22, 2021

KenitoInc commented Nov 23, 2021

pull-request-quantifier-deprecated bot commented Nov 23, 2021

What can I do to optimize my changes

How to interpret the change counts in git diff output

xuzhg Nov 23, 2021

habbes commented Nov 24, 2021

joaocpaiva commented Nov 24, 2021

mikepizzo commented Nov 24, 2021

habbes commented Nov 24, 2021 •

edited

Loading

mikepizzo left a comment

corranrogue9 Feb 24, 2022

corranrogue9 commented Feb 24, 2022

KenitoInc commented Mar 3, 2022

Cache DuplicateNameChecker in the OutputContext #2253

Cache DuplicateNameChecker in the OutputContext #2253

Conversation

KenitoInc commented Nov 18, 2021 • edited Loading

Issues

Description

Checklist (Uncheck if it is not completed)

Additional work necessary

habbes commented Nov 19, 2021

joaocpaiva commented Nov 22, 2021

KenitoInc commented Nov 23, 2021

pull-request-quantifier-deprecated bot commented Nov 23, 2021

What can I do to optimize my changes

How to interpret the change counts in git diff output

xuzhg Nov 23, 2021

Choose a reason for hiding this comment

habbes commented Nov 24, 2021

joaocpaiva commented Nov 24, 2021

mikepizzo commented Nov 24, 2021

habbes commented Nov 24, 2021 • edited Loading

mikepizzo left a comment

Choose a reason for hiding this comment

corranrogue9 Feb 24, 2022

Choose a reason for hiding this comment

corranrogue9 commented Feb 24, 2022

KenitoInc commented Mar 3, 2022

KenitoInc commented Nov 18, 2021 •

edited

Loading

habbes commented Nov 24, 2021 •

edited

Loading