Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache DuplicateNameChecker in the OutputContext #2253

Closed

Conversation

KenitoInc
Copy link
Contributor

@KenitoInc KenitoInc commented Nov 18, 2021

Issues

This pull request fixes #2180 .

Description

When serializing a response, we are calling [Serializer].CreateDuplicatePropertyNameChecker() multiple times to create an instance of the DuplicatePropertyNameChecker each time. This causes lots of allocations from initializing and resizing the internal propertyState dictionary.

In this PR, we create an Instance of the DuplicatePropertyNameChecker when we create the OutputContext. This allows us to re-use the DuplicatePropertyNameChecker throughout the response serialization process.

Checklist (Uncheck if it is not completed)

  • Test cases added
  • Build and test with one-click build and test script passed

Additional work necessary

If documentation update is needed, please add "Docs Needed" label to the issue and provide details about the required document change in the issue.

@habbes
Copy link
Contributor

habbes commented Nov 19, 2021

Seems some tests are failing due to duplicate names? Maybe they are situations where you don't want to share the same property checker? My assumption is that this may due to nested properties (navigation or complex): maybe the same property name exists on the parent entity as well as in a nested object, so a different property checker should be used.

Maybe we can make the property checker hiearchical, instead of creating a new checker every time we enter a scope, or having a single checker for the entire response, we can create a new checker per scope but cache it so that the next time we enter the scope for the same nested property, we re-use the same checker?

@joaocpaiva had also given an object pool as another suggestion.

@KenitoInc KenitoInc force-pushed the fix/perf-DuplicatePropertyNameCheck branch from 18448b8 to f97e47c Compare November 22, 2021 11:08
@joaocpaiva
Copy link
Contributor

Seems some tests are failing due to duplicate names? Maybe they are situations where you don't want to share the same property checker? My assumption is that this may due to nested properties (navigation or complex): maybe the same property name exists on the parent entity as well as in a nested object, so a different property checker should be used.

Maybe we can make the property checker hiearchical, instead of creating a new checker every time we enter a scope, or having a single checker for the entire response, we can create a new checker per scope but cache it so that the next time we enter the scope for the same nested property, we re-use the same checker?

@joaocpaiva had also given an object pool as another suggestion.

Seems like Kennedy was just missing a Reset(). Obviously, to reuse a collection it is important to call Clear() every time we enter a new scope that needs it with an empty state. Reusing single collection per operation, should go a long way in reducing allocations for this stack.

@KenitoInc
Copy link
Contributor Author

Seems some tests are failing due to duplicate names? Maybe they are situations where you don't want to share the same property checker? My assumption is that this may due to nested properties (navigation or complex): maybe the same property name exists on the parent entity as well as in a nested object, so a different property checker should be used.
Maybe we can make the property checker hiearchical, instead of creating a new checker every time we enter a scope, or having a single checker for the entire response, we can create a new checker per scope but cache it so that the next time we enter the scope for the same nested property, we re-use the same checker?
@joaocpaiva had also given an object pool as another suggestion.

Seems like Kennedy was just missing a Reset(). Obviously, to reuse a collection it is important to call Clear() every time we enter a new scope that needs it with an empty state. Reusing single collection per operation, should go a long way in reducing allocations for this stack.

I updated the code and called Reset() to fix the issue

@pull-request-quantifier-deprecated

This PR has 32 quantified lines of changes. In general, a change size of upto 200 lines is ideal for the best PR experience!


Quantification details

Label      : Extra Small
Size       : +19 -13
Percentile : 12.8%

Total files changed: 8

Change summary by file extension:
.cs : +19 -13

Change counts above are quantified counts, based on the PullRequestQuantifier customizations.

Why proper sizing of changes matters

Optimal pull request sizes drive a better predictable PR flow as they strike a
balance between between PR complexity and PR review overhead. PRs within the
optimal size (typical small, or medium sized PRs) mean:

  • Fast and predictable releases to production:
    • Optimal size changes are more likely to be reviewed faster with fewer
      iterations.
    • Similarity in low PR complexity drives similar review times.
  • Review quality is likely higher as complexity is lower:
    • Bugs are more likely to be detected.
    • Code inconsistencies are more likely to be detetcted.
  • Knowledge sharing is improved within the participants:
    • Small portions can be assimilated better.
  • Better engineering practices are exercised:
    • Solving big problems by dividing them in well contained, smaller problems.
    • Exercising separation of concerns within the code changes.

What can I do to optimize my changes

  • Use the PullRequestQuantifier to quantify your PR accurately
    • Create a context profile for your repo using the context generator
    • Exclude files that are not necessary to be reviewed or do not increase the review complexity. Example: Autogenerated code, docs, project IDE setting files, binaries, etc. Check out the Excluded section from your prquantifier.yaml context profile.
    • Understand your typical change complexity, drive towards the desired complexity by adjusting the label mapping in your prquantifier.yaml context profile.
    • Only use the labels that matter to you, see context specification to customize your prquantifier.yaml context profile.
  • Change your engineering behaviors
    • For PRs that fall outside of the desired spectrum, review the details and check if:
      • Your PR could be split in smaller, self-contained PRs instead
      • Your PR only solves one particular issue. (For example, don't refactor and code new features in the same PR).

How to interpret the change counts in git diff output

  • One line was added: +1 -0
  • One line was deleted: +0 -1
  • One line was modified: +1 -1 (git diff doesn't know about modified, it will
    interpret that line like one addition plus one deletion)
  • Change percentiles: Change characteristics (addition, deletion, modification)
    of this PR in relation to all other PRs within the repository.


Was this comment helpful? 👍  :ok_hand:  :thumbsdown: (Email)
Customize PullRequestQuantifier for this repository.

@KenitoInc KenitoInc marked this pull request as ready for review November 23, 2021 09:06
@KenitoInc KenitoInc requested review from odero, gathogojr, habbes, mikepizzo and xuzhg and removed request for odero November 23, 2021 09:07
@KenitoInc KenitoInc added the Ready for review Use this label if a pull request is ready to be reviewed label Nov 23, 2021
@@ -394,7 +398,7 @@ await this.WriteInstanceAnnotationNameAsync(propertyName, annotationName)
await this.valueSerializer.WriteResourceValueAsync(resourceValue,
expectedType,
treatLikeOpenProperty,
this.valueSerializer.CreateDuplicatePropertyNameChecker()).ConfigureAwait(false);
this.valueSerializer.JsonLightOutputContext.DuplicatePropertyNameChecker).ConfigureAwait(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a different level, it could be a "concurrency" problem?

I mean when we write a payload with multiple "levels", for example, Top Resource, Child resource, GrandChild resource...

The PropertyNameChecker is reused in each "level"? If yes, it's a big problem if Top Resource has the same property name with GrandChild resource?

@habbes
Copy link
Contributor

habbes commented Nov 24, 2021

@joaocpaiva @KenitoInc I still think using a shared DuplicateNameChecker when writing a resource with nested resources might be problematic even if Reset() is used. I think it would only be safe to share one instance per nesting level.

Consider this example:

{
   "Id": 1,
   "Foo": "Bar",
   "Nested": {
       "Fizz": "Buzz",
       "Foo": "Bar"
    },
   "Id": 2
}

If we use the same duplicate checker without resetting, then it will incorrectly flag the top-level Foo property and the Nested.Foo property as duplicates, even though they're not duplicates.

However, if we call Reset() on the instance before entering the Nested property and/or after leaving the Nested property, then the dictionary will be empty and it will not detect that the second Id field as being a duplicate and potentially serializing invalid json.

Since the tests are all passing, maybe we need more tests to cover these scenarios, or maybe my assumptions about how the duplicate checker is used is false.

@joaocpaiva
Copy link
Contributor

@joaocpaiva @KenitoInc I still think using a shared DuplicateNameChecker when writing a resource with nested resources might be problematic even if Reset() is used. I think it would only be safe to share one instance per nesting level.

Consider this example:

{
   "Id": 1,
   "Foo": "Bar",
   "Nested": {
       "Fizz": "Buzz",
       "Foo": "Bar"
    },
   "Id": 2
}

If we use the same duplicate checker without resetting, then it will incorrectly flag the top-level Foo property and the Nested.Foo property as duplicates, even though they're not duplicates.

However, if we call Reset() on the instance before entering the Nested property and/or after leaving the Nested property, then the dictionary will be empty and it will not detect that the second Id field as being a duplicate and potentially serializing invalid json.

Since the tests are all passing, maybe we need more tests to cover these scenarios, or maybe my assumptions about how the duplicate checker is used is false.

Makes sense @habbes @KenitoInc. We should make sure there is a test for that use case. Yet another alternative would be to check all top level properties, before processing the nested properties, so we could reset is safely at the end of every level?

@mikepizzo
Copy link
Member

Checking top level properties before writing nested properties isn't really an option, as the writer supports streaming the values so the service doesn't have to keep the entire response object in memory.


In reply to: 978052709

@habbes
Copy link
Contributor

habbes commented Nov 24, 2021

@joaocpaiva @mikepizzo if checking all top-level properties before nested properties is not an option, maybe we could still make significant gains from creating only one duplicate checker per nesting level if we have a response with a lot of entities with nested properties.

For example, assuming we're writing a response with 10 entities like:

[
{
    "Prop1": "value",
    "Nested1": {
         "N1Prop": "value"
     },
     "Prop2": "value",
     "Nested2": {
         "N2Prop": "value",
         "N2Nested": { ... }
      }
}
...
]

In the current implementation, I think we'll allocate 1 checker for the collection + (1 for Nested1 property + 1 for Nested2 + 1 for N2Nested) * 10 = 31 instances.

But we can safely created once instance of Nested1 and reuse it for Nested2 (after resetting) since they're at the same level and therefore do not overlap. And we can create another instance for N2Nested. We can cache these 3 instances and reuse them for all entities in the response. So in total we'll have 4 instances (including the one for the top-level collection) instead of 31, scaling only based on the depth of the response, rather than the length * number of nested properties in the response.

Also after processing an entity in the response, each dictionary would have grown to fit the largest number of properties at its nesting level, which means (I think) the dictionaries will probably not be resized after the first entity is written.

I assume this is the same improvement we'd get if we used an object pool?

Copy link
Member

@mikepizzo mikepizzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code seems to change semantics of duplicate name checking. Perhaps there's a better way to improve perf without caching/resetting.

@@ -120,7 +120,10 @@ public void ValidatePropertyOpenForAssociationLink(string propertyName)
/// </summary>
public void Reset()
{
propertyState.Clear();
if (propertyState.Count > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there was another reason for adding this if statement, but Dictionary<TKey, TValue>.Clear already does an identical count check as an optimization

@corranrogue9
Copy link
Contributor

Is there any way to write a test to ensure that the caching is working correctly?

@KenitoInc
Copy link
Contributor Author

Created a new PR #2328 with a different implementation

@KenitoInc KenitoInc closed this Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Extra Small Ready for review Use this label if a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance - DuplicatePropertyNameCheck costing 1.5% of allocations
6 participants